Re: [HACKERS] Proposal : REINDEX xxx VERBOSE
/tab-complete.c @@ -3335,7 +3335,6 @@ psql_completion(const char *text, int start, int end) /* REINDEX */ else if (pg_strcasecmp(prev_wd, REINDEX) == 0) { - static const char *const list_REINDEX[] = {TABLE, INDEX, SYSTEM, SCHEMA, DATABASE, NULL}; COMPLETE_WITH_LIST(list_REINDEX); The attached fix it and now seems good to me. Just one last note. IMHO we should add a regression to src/bin/scripts/090_reindexdb.pl. Thank you for your patch! (Sorry for attaching the patch still has compile error..) - 000_reindex_verbose_v13.patch Looks good to me. - 001_reindexdb_verbose_option_v1.patch I noticed a bug in reindexdb patch, so fixed version is attached. The regression test for reindexdb is added as well. Pushed. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
On Thu, Apr 30, 2015 at 12:57 PM, Sawada Masahiko sawada.m...@gmail.com wrote: On Wed, Apr 29, 2015 at 12:17 AM, David Steele da...@pgmasters.net wrote: On 4/28/15 2:14 AM, Sawada Masahiko wrote: On Fri, Apr 24, 2015 at 3:23 AM, David Steele da...@pgmasters.net wrote: I've also added some checking to make sure that if anything looks funny on the stack an error will be generated. Thanks for the feedback! Thank you for updating the patch! I ran the postgres regression test on database which is enabled pg_audit, it works fine. Looks good to me. If someone don't have review comment or bug report, I will mark this as Ready for Committer. Thank you! I appreciate all your work reviewing this patch and of course everyone else who commented on, reviewed or tested the patch along the way. I have changed the status this to Ready for Committer. The specification of session audit logging seems to be still unclear to me. For example, why doesn't session audit logging log queries accessing to a catalog like pg_class? Why doesn't it log any queries executed in aborted transaction state? Since there is no such information in the document, I'm afraid that users would easily get confused with it. Even if we document it, I'm not sure if the current behavior is good for the audit purpose. For example, some users may want to log even queries accessing to the catalogs. I have no idea about when the current CommitFest will end. But probably we don't have that much time left. So I'm thinking that maybe we should pick up small, self-contained and useful part from the patch and focus on that. If we try to commit every features that the patch provides, we might get nothing at least in 9.5, I'm afraid. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On Wed, Apr 15, 2015 at 12:00 AM, Magnus Hagander mag...@hagander.net wrote: On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 04/14/2015 05:42 AM, Robert Haas wrote: On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: As to RLS - yeah, that's where I think a lot of the possible covert channel attacks are. But it doesn't have to be RLS per se. It can be, for example, a table some of whose contents you expose via a security barrier view. It can be a security-definer function that you call and it returns some data that you don't have rights to view directly. Basically, it can be any table to which you have some access, but not complete access. Then you can use timing attacks, scrutinize EXPLAIN plans for clues, look at CTIDs, and so on. Basically, I'm worried that it's going to be completely impractical to prevent a certain amount of information leakage when you have partial access to a table, and that in a largely-futile effort to try to prevent it, we'll end up making a whole bunch of things (like the WAL insert position) super-user only, and that this will in fact be a net reduction in security because it'll encourage people to use the superuser account more. Perhaps that concern is ill-founded, but that's what I'm worried about. I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. I think it's a fairly narrow attack vector. As long as we document it clearly, and make it easy enough to turn it off, I think that's definitely enough. Most people will not care at all, I'm sure - but we need to cater to those that do. I think we could probably even get away with just documenting the risk and having people turn off the compression *completely* if they care about it, but if we can do it at a table level, that's obviously a lot better. +1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On Wed, Apr 15, 2015 at 11:55 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Apr 15, 2015 at 11:10 AM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Apr 15, 2015 at 12:00 AM, Magnus Hagander mag...@hagander.net wrote: On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 04/14/2015 05:42 AM, Robert Haas wrote: On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: As to RLS - yeah, that's where I think a lot of the possible covert channel attacks are. But it doesn't have to be RLS per se. It can be, for example, a table some of whose contents you expose via a security barrier view. It can be a security-definer function that you call and it returns some data that you don't have rights to view directly. Basically, it can be any table to which you have some access, but not complete access. Then you can use timing attacks, scrutinize EXPLAIN plans for clues, look at CTIDs, and so on. Basically, I'm worried that it's going to be completely impractical to prevent a certain amount of information leakage when you have partial access to a table, and that in a largely-futile effort to try to prevent it, we'll end up making a whole bunch of things (like the WAL insert position) super-user only, and that this will in fact be a net reduction in security because it'll encourage people to use the superuser account more. Perhaps that concern is ill-founded, but that's what I'm worried about. I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. I think it's a fairly narrow attack vector. As long as we document it clearly, and make it easy enough to turn it off, I think that's definitely enough. Most people will not care at all, I'm sure - but we need to cater to those that do. I think we could probably even get away with just documenting the risk and having people turn off the compression *completely* if they care about it, but if we can do it at a table level, that's obviously a lot better. +1 OK. I am fine to implement anything required here if needed, meaning the following: 1) Doc patch to mention that it is possible that compression can give hints to attackers when working on sensible fields that have a non-fixed size. I think that this patch is enough as the first step. 2) Switch at relation level to control wal_compression. ALTER TABLE SET is not allowed on system catalog like pg_authid. So should we change it so that a user can change the flag even on system catalog? I'm afraid that the change might cause another problem, though. Probably we can disable the compression on every system catalogs by default. But I can imagine that someone wants to enable the compression even on system catalog. For example, pg_largeobject may cause lots of FPW. This needs to change XLogRecordAssemble by adding some new logic to check if a relation is enforcing WAL compression or not. As a reloption, there are three possible values: true, false and fallback to system default. Also, I think that we should simply extend XLogRegisterBuffer() and pass to it the reloption flag that is then registered in registered_buffer, and XLogRecordAssemble() decides with this flag if block is compressed or not. Do we want to add this reloption switch to indexes as well? Maybe. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : REINDEX xxx VERBOSE
On Wed, Apr 8, 2015 at 10:53 PM, Sawada Masahiko sawada.m...@gmail.com wrote: On Wed, Apr 8, 2015 at 1:09 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Apr 8, 2015 at 1:57 AM, Sawada Masahiko sawada.m...@gmail.com wrote: On Wed, Apr 8, 2015 at 1:11 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Tue, Apr 7, 2015 at 1:04 PM, Sawada Masahiko sawada.m...@gmail.com wrote: On Tue, Apr 7, 2015 at 10:12 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Tue, Apr 7, 2015 at 7:22 AM, Sawada Masahiko sawada.m...@gmail.com wrote: Thank you for your reviewing. I modified the patch and attached latest version patch(v7). Please have a look it. Looks good to me. Attached patch (v8) just fix a tab indentation in gram.y. I had forgotten fix a tab indentation, sorry. Thank you for reviewing! It looks good to me too. Can this patch be marked as Ready for Committer? +1 Changed status to Ready for Committer. The patch adds new syntax like REINDEX ... WITH VERBOSE, i.e., () is not added after WITH clause. Did we reach the consensus about this syntax? The last email from Robert just makes me think that () should be added into the syntax. Thank you for taking time for this patch! I removed the FORCE option from REINDEX, so you would need to update the patch. This was quite complicated issue since we already have a lot of style command currently. We can think many of style from various perspective: kind of DDL, new or old command, maintenance command. And each command has each its story. I believe we have reached the consensus with this style at least once (please see previous discussion), but we might needs to discuss more... Okay, another question is that; WITH must be required whenever the options are specified? Or should it be abbreviatable? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removal of FORCE option in REINDEX
On Thu, Apr 9, 2015 at 12:33 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/8/15 7:04 AM, Fujii Masao wrote: I'm thinking to apply the attached patch. But does anyone want to keep supporting the option? Why? Nuke it. There seem no objections, so just pushed. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind and log messages
On Tue, Apr 7, 2015 at 11:59 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Apr 6, 2015 at 9:10 PM, Fujii Masao masao.fu...@gmail.com wrote: I'm not familiar with native language support (sorry), but don't we need to add the shortcut of gettext into every calls of pg_log and pg_fatal, e.g., change pg_fatal(xxx) to pg_fatal(_(xxx))? I know that fprintf() in pg_Log_v() has such shortcut, but I'm not sure if that's enough or not. I think I addressed those things... case PG_FATAL: -printf(\n%s, _(message)); -printf(%s, _(Failure, exiting\n)); +fprintf(stderr, _(\n%s: fatal: %s\n), progname, message); +fprintf(stderr, _(Failure, exiting\n)); Why do we need to output the error level like fatal? This seems also inconsistent with the log message policy of other tools. initdb and psql do not for a couple of warning messages, but perhaps I am just taking consistency with the original code too seriously. Why do we need to output \n at the head of the message? The second message Failure, exiting is really required? I think that those things were here to highlight the fact that this is a fatal bailout, but the error code would help the same way I guess... I eliminated a bunch of newlines in the log messages that seemed really unnecessary to me, simplifying a bit the whole. While hacking this stuff, I noticed as well that pg_rewind could be called as root on non-Windows platform, that's dangerous from a security point of view as process manipulates files in PGDATA. Hence let's block that. On Windows, a restricted token should be used. Good catch! I think it's better to implement it as a separate patch because it's very different from log message problem. Attached are new patches: - 0001 fixes the restriction issues: restricted token on Windows, and forbid non-root user on non-Windows. Thanks for the patch! I'd like to push this patch at first. But one comment is + You must run %s as the PostgreSQL superuser.\n, Isn't the term PostgreSQL superuser confusing? I'm afraid that a user might confuse PostgreSQL superuser with a database superuser. I see you just borrowed that term from pg_resetxlog.c, though. BTW, initdb and pg_ctl also have the same check and the error message is as follows. Isn't this better? Thought? (%s: cannot be run as root\n Please log in (using, e.g., \su\) as the (unprivileged) user that will\n own the server process.\n Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind and log messages
On Mon, Apr 6, 2015 at 10:01 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Fujii Masao wrote: On Mon, Apr 6, 2015 at 5:33 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Apr 6, 2015 at 1:41 PM, Michael Paquier wrote: I guess that you are working on a patch? If not, you are looking for one? Code-speaking, this gives the patch attached. Thanks! Here are the review comments: I'm not familiar with native language support (sorry), but don't we need to add the shortcut of gettext into every calls of pg_log and pg_fatal, e.g., change pg_fatal(xxx) to pg_fatal(_(xxx))? I know that fprintf() in pg_Log_v() has such shortcut, but I'm not sure if that's enough or not. It's not necessary for pg_fatal and the like, because those functions are marked to have their first argument automatically translated in nls.mk. This means that the string literal is automatically extracted into pg_rewind.pot for translators. Of course, the function itself must call _() (or some variant thereof) to actually fetch the translated string at run time. Understood. Thanks! BTW, as far as I read pg_rewind's nls.mk correctly, it also has two problems. (1) file_ops.c should be added into GETTEXT_FILES. (2) pg_log should be pg_log:2 in GETTEXT_TRIGGERS Another thing is compound messages like foo has happened\nSee --help for usage.\n and bar didn't happen\.See --help for usage. In those cases, the see --help part would need to be translated over and over, so it's best to separate them in phrases to avoid repetitive work for translators. Not sure how to do this -- maybe something like _(foo has happened\n) _(See --help) but I'm not sure how to appease the compiler. Having them in two separate pg_log() calls (or whatever) was handy for this reason. Yep. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind and log messages
On Wed, Apr 8, 2015 at 5:53 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Heikki Linnakangas wrote: On 04/07/2015 05:59 AM, Michael Paquier wrote: Fix inconsistent handling of logs in pg_rewind pg_rewind was handling a couple of things differently compared to the other src/bin utilities: - Logging output needs to be flushed on stderr, not stdout Agreed in general. But there's also precedent in printing some stuff to stdout: pg_ctl does that for the status message, like server starting. As does initdb. I'm pretty unclear on what the rule here is. One principle that sometimes helps is to consider what happens if you use the command as part of a larger pipeline; progress messages can be read by some other command further down (and perhaps report them in a dialog box, if you embed the program in a GUI, say), but error messages should probably be processed differently; normally the pipeline would be aborted as a whole. Make sense. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : REINDEX xxx VERBOSE
On Wed, Apr 8, 2015 at 1:57 AM, Sawada Masahiko sawada.m...@gmail.com wrote: On Wed, Apr 8, 2015 at 1:11 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Tue, Apr 7, 2015 at 1:04 PM, Sawada Masahiko sawada.m...@gmail.com wrote: On Tue, Apr 7, 2015 at 10:12 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Tue, Apr 7, 2015 at 7:22 AM, Sawada Masahiko sawada.m...@gmail.com wrote: Thank you for your reviewing. I modified the patch and attached latest version patch(v7). Please have a look it. Looks good to me. Attached patch (v8) just fix a tab indentation in gram.y. I had forgotten fix a tab indentation, sorry. Thank you for reviewing! It looks good to me too. Can this patch be marked as Ready for Committer? +1 Changed status to Ready for Committer. The patch adds new syntax like REINDEX ... WITH VERBOSE, i.e., () is not added after WITH clause. Did we reach the consensus about this syntax? The last email from Robert just makes me think that () should be added into the syntax. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Removal of FORCE option in REINDEX
Hi, While reviewing the REINDEX VERBOSE patch, I felt inclined to remove FORCE option support from REINDEX command. It has been marked obsolete since very old version 7.4. I think that it's no longer worth keeping supporting it. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind and log messages
On Mon, Apr 6, 2015 at 5:33 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Apr 6, 2015 at 1:41 PM, Michael Paquier wrote: I guess that you are working on a patch? If not, you are looking for one? Code-speaking, this gives the patch attached. Thanks! Here are the review comments: I'm not familiar with native language support (sorry), but don't we need to add the shortcut of gettext into every calls of pg_log and pg_fatal, e.g., change pg_fatal(xxx) to pg_fatal(_(xxx))? I know that fprintf() in pg_Log_v() has such shortcut, but I'm not sure if that's enough or not. case PG_FATAL: -printf(\n%s, _(message)); -printf(%s, _(Failure, exiting\n)); +fprintf(stderr, _(\n%s: fatal: %s\n), progname, message); +fprintf(stderr, _(Failure, exiting\n)); Why do we need to output the error level like fatal? This seems also inconsistent with the log message policy of other tools. Why do we need to output \n at the head of the message? The second message Failure, exiting is really required? I eliminated a bunch of newlines in the log messages that seemed really unnecessary to me, simplifying a bit the whole. While hacking this stuff, I noticed as well that pg_rewind could be called as root on non-Windows platform, that's dangerous from a security point of view as process manipulates files in PGDATA. Hence let's block that. On Windows, a restricted token should be used. Good catch! I think it's better to implement it as a separate patch because it's very different from log message problem. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Upper-case error in docs regarding PQmakeEmptyPGresult
On Mon, Apr 6, 2015 at 11:40 AM, Michael Paquier michael.paqu...@gmail.com wrote: Hi all, I just spotted that PQmakeEmptyPGresult is named PQmakeEmptyPGResult in one place (not Result, but result) in the docs. A patch is attached. Pushed. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_rewind and log messages
Hi, I found that pg_rewind has several problems about its log messages. (1) It outputs an error message to stdout not stderr. (2) The tool name should be added at the head of log message as follows, but not in pg_rewind. pg_basebackup: no target directory specified (3) if (datadir_source == NULL connstr_source == NULL) { pg_fatal(no source specified (--source-pgdata or --source-server)\n); pg_fatal(Try \%s --help\ for more information.\n, progname); exit(1); Since the first call of pg_fatal exits with 1, the subsequent pg_fatal and exit will never be called. (4) ISTM that set_pglocale_pgservice() needs to be called, but not in pg_rewind. (5) printf() is used to output an error in some files, e.g., timeline.c and parsexlog.c. These printf() should be replaced with pg_log or pg_fatal? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The return value of allocate_recordbuf()
On Fri, Apr 3, 2015 at 2:30 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Apr 3, 2015 at 12:56 PM, Fujii Masao wrote: The first patch looks good to me basically. But I have one comment: shouldn't we expose pg_malloc_extended as a global function like we did pg_malloc? Some frontends might need to use it in the future. Yes, it makes sense as the other functions do it too. So I refactored the patch and defined a new static inline routine, pg_malloc_internal(), that all the other functions call to reduce the temperature in this code path that I suppose can become quite hot even for frontends. In the second patch I added as well what was needed for pg_rewind. Thanks for updating the patches! I pushed the first and a part of the second patch. Regarding the second patch, you added the checks of the return value of XLogReaderAllocate(). But it seems half-baked. XLogReaderAllocate() still uses palloc(), but don't we need to replace it with palloc_extended(), too? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The return value of allocate_recordbuf()
On Fri, Apr 3, 2015 at 8:37 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Apr 3, 2015 at 6:35 PM, Fujii Masao wrote: Regarding the second patch, you added the checks of the return value of XLogReaderAllocate(). But it seems half-baked. XLogReaderAllocate() still uses palloc(), but don't we need to replace it with palloc_extended(), too? Doh, you are right. I missed three places. Attached is a new patch completing the fix. Thanks for the patch! I updated two source code comments and changed the log message when XLogReaderAllocate returns NULL within XLOG_DEBUG block. Just pushed. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The return value of allocate_recordbuf()
On Thu, Feb 12, 2015 at 4:02 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Feb 11, 2015 at 2:13 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Feb 9, 2015 at 7:02 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Feb 9, 2015 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote: MemoryContextAllocExtended() was added, so isn't it time to replace palloc() with MemoryContextAllocExtended(CurrentMemoryContext, MCXT_ALLOC_NO_OOM) in allocate_recordbuf()? Yeah, let's move on here, but not with this patch I am afraid as this breaks any client utility using xlogreader.c in frontend, like pg_xlogdump or pg_rewind because MemoryContextAllocExtended is not available in frontend, only in backend. We are going to need something like palloc_noerror, defined on both backend (mcxt.c) and frontend (fe_memutils.c) side. Unfortunately, that gets us back into the exact terminological dispute that we were hoping to avoid. Perhaps we could compromise on palloc_extended(). Yes, why not using palloc_extended instead of palloc_noerror that has been clearly rejected in the other thread. Now, for palloc_extended we should copy the flags of MemoryContextAllocExtended to fe_memutils.h and have the same interface between frontend and backend (note that MCXT_ALLOC_HUGE has no real utility for frontends). Attached is a patch to achieve that, completed with a second patch for the problem related to this thread. Note that in the second patch I have added an ERROR in logical.c after calling XLogReaderAllocate, this was missing.. The first patch looks good to me basically. But I have one comment: shouldn't we expose pg_malloc_extended as a global function like we did pg_malloc? Some frontends might need to use it in the future. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
On Tue, Mar 10, 2015 at 5:29 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hi, the attached is the v5 patch. - Do feGetCurrentTimestamp() only when necessary. - Rebased to current master At Mon, 2 Mar 2015 20:21:36 +0900, Fujii Masao masao.fu...@gmail.com wrote in cahgqgwg1tjhpg03ozgwokxt5wyd5v4s3hutgsx7rotbhhnj...@mail.gmail.com On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, the attached is the v4 patch that checks feedback timing every WAL segments boundary. .. I said that checking whether to send feedback every boundary between WAL segments seemed too long but after some rethinking, I changed my mind. - The most large possible delay source in the busy-receive loop is fsyncing at closing WAL segment file just written, this can take several seconds. Freezing longer than the timeout interval could not be saved and is not worth saving anyway. - 16M bytes-disk-writes intervals between gettimeofday() seems to be gentle enough even on platforms where gettimeofday() is rather heavy. Sounds reasonable to me. So we don't need to address the problem in walreceiver side so proactively because it tries to send the feedback every after flushing the WAL records. IOW, the problem you observed is less likely to happen. Right? +now = feGetCurrentTimestamp(); +if (standby_message_timeout 0 Surely it would hardly happen, especially on ordinary configuration. However, the continuous receiving of the replication stream is a quite normal behavior even if hardly happens. So the receiver should guarantee the feedbacks to be sent by logic as long as it is working normally, as long as the code for the special case won't be too large and won't take too long time:). The current walreceiver doesn't look trying to send feedbacks after flushing every wal records. HandleCopyStream will restlessly process the records in a gapless replication stream, sending feedback only when keepalive packet arrives. It is the fact that the response to the keepalive would help greatly but it is not what the keepalives are for. It is intended to be used to confirm if a silent receiver is still alive. Even with this fix, the case that one write or flush takes longer time than the feedback interval cannot be saved, but it would be ok since it should be regarded as disorder. Minor comment: should feGetCurrentTimestamp() be called after the check of standby_message_timeout 0, to avoid unnecessary calls of that? Ah, you're right. I'll fixed it. Even if the timeout has not elapsed yet, if synchronous mode is true, we should send feedback here? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On Wed, Apr 1, 2015 at 7:00 PM, Jehan-Guillaume de Rorthais j...@dalibo.com wrote: Hi, As I'm writing a doc patch for 9.4 - 9.0, I'll discuss below on this formula as this is the last one accepted by most of you. On Mon, 3 Nov 2014 12:39:26 -0800 Jeff Janes jeff.ja...@gmail.com wrote: It looked to me that the formula, when descending from a previously stressed state, would be: greatest(1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments) + 1 + 2 * checkpoint_segments + 1 It lacks a closing parenthesis. I guess the formula is: greatest ( (1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments ) + 1 + 2 * checkpoint_segments + 1 This assumes logs are filled evenly over a checkpoint cycle, which is probably not true because there is a spike in full page writes right after a checkpoint starts. But I didn't have a great deal of confidence in my analysis. The only problem I have with this formula is that considering checkpoint_completion_target ~ 1 and wal_keep_segments = 0, it becomes: 4 * checkpoint_segments + 2 Which violate the well known, observed and default one: 3 * checkpoint_segments + 1 A value above this formula means the system can not cope with the number of file to flush. The doc is right about that: If, due to a short-term peak of log output rate, there are more than 3 * varnamecheckpoint_segments/varname + 1 segment files, the unneeded segment files will be deleted The formula is wrong in the doc when wal_keep_segments 0 The first line reflects the number of WAL that will be retained as-is, I agree with this files MUST be retained: the set of checkpoint_segments WALs beeing flushed and the checkpoint_completion_target ones written in the meantime. the second is the number that will be recycled for future use before starting to delete them. disagree cause the WAL files beeing written are actually consuming recycled WALs in the meantime. Your formula expect written files are created and recycled ones never touched, leading to this checkpoint_segment + 1 difference between formulas. My reading of the code is that wal_keep_segments is computed from the current end of WAL (i.e the checkpoint record), not from the checkpoint redo point. If I distribute the part outside the 'greatest' into both branches of the 'greatest', I don't get the same answer as you do for either branch. So The formula, using checkpoint_completion_target=1, should be: greatest ( checkpoint_segments, wal_keep_segments ) + 2 * checkpoint_segments + 1 No. Please imagine how many WAL files can exist at the end of checkpoint. At the end of checkpoint, we have to leave all the WAL files which were generated since the starting point of previous checkpoint for the future crash recovery. The number of these WAL files is (1 + checkpoint_completion_target) * checkpoint_segments or wal_keep_segments In addition to these files, at the end of checkpoint, old WAL files which were generated before the starting point of previous checkpoint are recycled. The number of these WAL files is at most 2 * checkpoint_segments + 1 Note that *usually* there are not such many WAL files at the end of checkpoint. But this can happen after the peak of WAL logging generates too much WAL files. So the sum of those is the right formula, i.e., (3 + checkpoint_completion_target) * checkpoint_segments + 1 or wal_keep_segments + 2 * checkpoint_segments + 1 If checkpoint_completion_target is 1 and wal_keep_segments is 0, it can become 4 * checkpoint_segments + 1. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table-level log_autovacuum_min_duration
On Thu, Mar 19, 2015 at 1:43 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Mar 19, 2015 at 12:40 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Mar 19, 2015 at 12:23 PM, Fujii Masao masao.fu...@gmail.com wrote: Are you planning to update the patch so that it's based on the commit 0d83138? Yes... Very soon. And here is the rebased patch. Thanks for rebasing the patch! Looks good to me. One concern about this patch is; currently log_autovacuum_min_duration can be changed even while autovacuum worker is running. So, for example, when the admin notices that autovacuum is taking very long time, he or she can enable logging of autovacuum activity on the fly. But this patch completely prevents us from doing that, because, with the patch, autovacuum worker always picks up the latest setting value at its starting time and then keeps using it to the end. Probably I can live with this. But does anyone has other thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table-level log_autovacuum_min_duration
On Fri, Mar 6, 2015 at 1:07 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Mar 6, 2015 at 12:44 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Mar 5, 2015 at 9:49 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Mar 5, 2015 at 7:10 PM, Fujii Masao wrote: With the patch, VACUUM ANALYZE VERBOSE doesn't emit any verbose message. Why did you remove that functionality? Oops. Sorry about that. In gram.y, the combination of VacuumStmt with AnalyzeStmt overwrote the value of log_min_duration incorrectly. I also found another bug related to logging of ANALYZE not working correctly because of the use of IsAutoVacuumWorkerProcess() instead of VACOPT_VERBOSE (this is reducing the diffs of the patch btw). All those things are fixed in the attached. Thanks for updating the patch! Why does log_min_duration need to be set even when manual VACUUM command is executed? Per the latest version of the patch, log_min_duration is checked only when the process is autovacuum worker. So ISTM that log_min_duration doesn't need to be set in gram.y. It's even confusing to me. Or if you're going to implement something like VACUUM VERBOSE DURATION n (i.e., verbose message is output if n seconds have been elapsed), that might be necessary, though... Thanks for reminding. The DURATION-like clause was exactly a point mentioned by Anzai-san upthread, and it made sense to me to be in-line with the other parameters controlling the freeze (discussion somewhat related to that = http://www.postgresql.org/message-id/CAB7nPqRZX7Pv2B-R7xHmAh52tfjAQGfy9btqwFstgQgXks=i...@mail.gmail.com) but we can live without it for this patch as VACOPT_VERBOSE is used only by manual VACUUM and not by autovacuum to choose the log elevel. Are you planning to update the patch so that it's based on the commit 0d83138? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xloginsert.c hole_length warning on gcc 4.8.3
On Sat, Mar 14, 2015 at 8:39 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Hi there, with gcc 4.8.3, I'm getting this warning in xloginsert.c: Thanks for the report! I fixed this problem at the commit cd6c45c. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Wed, Mar 11, 2015 at 7:08 AM, Rahila Syed rahilasye...@gmail.com wrote: Hello, I have some minor comments The comments have been implemented in the attached patch. Thanks for updating the patch! I just changed a bit and finally pushed it. Thanks everyone involved in this patch! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Mon, Mar 9, 2015 at 9:08 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila rahila.s...@nttdata.com wrote: Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header. Thanks for updating the patch! Attached is the refactored version of the patch. Cool. Thanks! I have some minor comments: Thanks for the comments! +Turning this parameter on can reduce the WAL volume without Turning valueon/ this parameter That tag is not used in other place in config.sgml, so I'm not sure if that's really necessary. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila rahila.s...@nttdata.com wrote: Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header. Thanks for updating the patch! Attached is the refactored version of the patch. Regards, -- Fujii Masao *** a/contrib/pg_xlogdump/pg_xlogdump.c --- b/contrib/pg_xlogdump/pg_xlogdump.c *** *** 359,376 XLogDumpCountRecord(XLogDumpConfig *config, XLogDumpStats *stats, rec_len = XLogRecGetDataLen(record) + SizeOfXLogRecord; /* ! * Calculate the amount of FPI data in the record. Each backup block ! * takes up BLCKSZ bytes, minus the hole length. * * XXX: We peek into xlogreader's private decoded backup blocks for the ! * hole_length. It doesn't seem worth it to add an accessor macro for ! * this. */ fpi_len = 0; for (block_id = 0; block_id = record-max_block_id; block_id++) { if (XLogRecHasBlockImage(record, block_id)) ! fpi_len += BLCKSZ - record-blocks[block_id].hole_length; } /* Update per-rmgr statistics */ --- 359,375 rec_len = XLogRecGetDataLen(record) + SizeOfXLogRecord; /* ! * Calculate the amount of FPI data in the record. * * XXX: We peek into xlogreader's private decoded backup blocks for the ! * bimg_len indicating the length of FPI data. It doesn't seem worth it to ! * add an accessor macro for this. */ fpi_len = 0; for (block_id = 0; block_id = record-max_block_id; block_id++) { if (XLogRecHasBlockImage(record, block_id)) ! fpi_len += record-blocks[block_id].bimg_len; } /* Update per-rmgr statistics */ *** *** 465,473 XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record) blk); if (XLogRecHasBlockImage(record, block_id)) { ! printf( (FPW); hole: offset: %u, length: %u\n, ! record-blocks[block_id].hole_offset, ! record-blocks[block_id].hole_length); } putchar('\n'); } --- 464,485 blk); if (XLogRecHasBlockImage(record, block_id)) { ! if (record-blocks[block_id].bimg_info ! BKPIMAGE_IS_COMPRESSED) ! { ! printf( (FPW); hole: offset: %u, length: %u, compression saved: %u\n, ! record-blocks[block_id].hole_offset, ! record-blocks[block_id].hole_length, ! BLCKSZ - ! record-blocks[block_id].hole_length - ! record-blocks[block_id].bimg_len); ! } ! else ! { ! printf( (FPW); hole: offset: %u, length: %u\n, ! record-blocks[block_id].hole_offset, ! record-blocks[block_id].hole_length); ! } } putchar('\n'); } *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 2282,2287 include_dir 'conf.d' --- 2282,2311 /listitem /varlistentry + varlistentry id=guc-wal-compression xreflabel=wal_compression + termvarnamewal_compression/varname (typeboolean/type) + indexterm +primaryvarnamewal_compression/ configuration parameter/primary + /indexterm + /term + listitem +para + When this parameter is literalon/, the productnamePostgreSQL/ + server compresses a full page image written to WAL when + xref linkend=guc-full-page-writes is on or during a base backup. + A compressed page image will be decompressed during WAL replay. + The default value is literaloff/ +/para + +para + Turning this parameter on can reduce the WAL volume without + increasing the risk of unrecoverable data corruption, + but at the cost of some extra CPU time by the compression during + WAL logging and the decompression during WAL replay. +/para + /listitem + /varlistentry + varlistentry id=guc-wal-buffers xreflabel=wal_buffers termvarnamewal_buffers/varname (typeinteger/type) indexterm *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 89,94 char *XLogArchiveCommand = NULL; --- 89,95 bool EnableHotStandby = false; bool fullPageWrites = true; bool wal_log_hints = false; + bool wal_compression = false; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; int wal_level = WAL_LEVEL_MINIMAL; *** a/src/backend/access/transam/xloginsert.c --- b/src/backend/access/transam/xloginsert.c *** *** 24,35 --- 24,39 #include access/xlog_internal.h #include access/xloginsert.h #include catalog/pg_control.h + #include common/pg_lzcompress.h #include miscadmin.h #include storage/bufmgr.h #include storage/proc.h #include utils/memutils.h #include pg_trace.h + /* Buffer size required to store a compressed version
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Mon, Feb 16, 2015 at 9:08 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-16 20:55:20 +0900, Michael Paquier wrote: On Mon, Feb 16, 2015 at 8:30 PM, Syed, Rahila rahila.s...@nttdata.com wrote: Regarding the sanity checks that have been added recently. I think that they are useful but I am suspecting as well that only a check on the record CRC is done because that's reliable enough and not doing those checks accelerates a bit replay. So I am thinking that we should simply replace them by assertions. Removing the checks makes sense as CRC ensures correctness . Moreover ,as error message for invalid length of record is present in the code , messages for invalid block length can be redundant. Checks have been replaced by assertions in the attached patch. After more thinking, we may as well simply remove them, an error with CRC having high chances to complain before reaching this point... Surely not. The existing code explicitly does it like if (blk-has_data blk-data_len == 0) report_invalid_record(state, BKPBLOCK_HAS_DATA set, but no data included at %X/%X, (uint32) (state-ReadRecPtr 32), (uint32) state-ReadRecPtr); these cross checks are important. And I see no reason to deviate from that. The CRC sum isn't foolproof - we intentionally do checks at several layers. And, as you can see from some other locations, we actually try to *not* fatally error out when hitting them at times - so an Assert also is wrong. Heikki: /* cross-check that the HAS_DATA flag is set iff data_length 0 */ if (blk-has_data blk-data_len == 0) report_invalid_record(state, BKPBLOCK_HAS_DATA set, but no data included at %X/%X, (uint32) (state-ReadRecPtr 32), (uint32) state-ReadRecPtr); if (!blk-has_data blk-data_len != 0) report_invalid_record(state, BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X, (unsigned int) blk-data_len, (uint32) (state-ReadRecPtr 32), (uint32) state-ReadRecPtr); those look like they're missing a goto err; to me. Yes. I pushed the fix. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Mar 5, 2015 at 10:28 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-03-05 12:14:04 +, Syed, Rahila wrote: Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header. FWIW, I personally won't commit it with things done that way. I think it's going the wrong way, leading to a harder to interpret and less flexible format. I'm not going to further protest if Fujii or Heikki commit it this way though. I'm pretty sure that we can discuss the *better* WAL format even after committing this patch. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com wrote: Currently, WALReceiver writes and fsyncs data it receives. Clearly, while we are waiting for an fsync we aren't doing any other useful work. Following patch starts WALWriter during recovery and makes it responsible for fsyncing data, allowing WALReceiver to progress other useful actions. With the patch, replication didn't work fine in my machine. I started the standby server after removing all the WAL files from the standby. ISTM that the patch doesn't handle that case. That is, in that case, the standby tries to start up walreceiver and replication to retrieve the REDO-starting checkpoint record *before* starting up walwriter (IOW, before reaching the consistent point). Then since walreceiver works without walwriter, no received WAL data cannot be fsync'd in the standby. So replication cannot advance furthermore. I think that walwriter needs to start before walreceiver starts. I just marked this patch as Waiting on Author. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table-level log_autovacuum_min_duration
On Thu, Feb 19, 2015 at 3:32 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Feb 19, 2015 at 2:13 PM, Naoya Anzai anzai-na...@mxu.nes.nec.co.jp wrote: As a result, I think you should not delete VACOPT_VERBOSE. In v8 it is not deleted. It is still declared, and its use is isolated in gram.y, similarly to VACOPT_FREEZE. According to the last mail I have posted, the difference of manual-vacuum log and auto-vacuum log exists clearly. Did you test the latest patch v8? I have added checks in it to see if a process is an autovacuum worker to control elevel and the extra logs of v7 do not show up. So, at least you should not touch the mechanism of VACOPT_VERBOSE until both vacuum log formats are unified to a same format. If you mean that we should have the same kind of log outputs for autovacuum and manual vacuum, I think that this is not going to happen. Autovacuum entries are kept less verbose on purpose, contract that v7 clealy broke. If you agree my think, please undo your removing VACOPT_VERBOSE work. Well, I don't agree :) And I am guessing that you did not look at v8 as well. Centralizing the control of logs using log_min_duration is more extensible than simply having VACOPT_VERBOSE. With the patch, VACUUM ANALYZE VERBOSE doesn't emit any verbose message. Why did you remove that functionality? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table-level log_autovacuum_min_duration
On Thu, Mar 5, 2015 at 9:49 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Mar 5, 2015 at 7:10 PM, Fujii Masao wrote: With the patch, VACUUM ANALYZE VERBOSE doesn't emit any verbose message. Why did you remove that functionality? Oops. Sorry about that. In gram.y, the combination of VacuumStmt with AnalyzeStmt overwrote the value of log_min_duration incorrectly. I also found another bug related to logging of ANALYZE not working correctly because of the use of IsAutoVacuumWorkerProcess() instead of VACOPT_VERBOSE (this is reducing the diffs of the patch btw). All those things are fixed in the attached. Thanks for updating the patch! Why does log_min_duration need to be set even when manual VACUUM command is executed? Per the latest version of the patch, log_min_duration is checked only when the process is autovacuum worker. So ISTM that log_min_duration doesn't need to be set in gram.y. It's even confusing to me. Or if you're going to implement something like VACUUM VERBOSE DURATION n (i.e., verbose message is output if n seconds have been elapsed), that might be necessary, though... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] File based Incremental backup v8
On Thu, Mar 5, 2015 at 1:59 AM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: Hi Fujii, Il 03/03/15 11:48, Fujii Masao ha scritto: On Tue, Mar 3, 2015 at 12:36 AM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: Il 02/03/15 14:21, Fujii Masao ha scritto: On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: Hi, I've attached an updated version of the patch. basebackup.c:1565: warning: format '%lld' expects type 'long long int', but argument 8 has type '__off_t' basebackup.c:1565: warning: format '%lld' expects type 'long long int', but argument 8 has type '__off_t' pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code I'll add the an explicit cast at that two lines. When I applied three patches and compiled the code, I got the above warnings. How can we get the full backup that we can use for the archive recovery, from the first full backup and subsequent incremental backups? What commands should we use for that, for example? It's better to document that. I've sent a python PoC that supports the plain format only (not the tar one). I'm currently rewriting it in C (with also the tar support) and I'll send a new patch containing it ASAP. Yeah, if special tool is required for that purpose, the patch should include it. I'm working on it. The interface will be exactly the same of the PoC script I've attached to 54c7cdad.6060...@2ndquadrant.it What does 1 of the heading line in backup_profile mean? Nothing. It's a version number. If you think it's misleading I will remove it. A version number of file format of backup profile? If it's required for the validation of backup profile file as a safe-guard, it should be included in the profile file. For example, it might be useful to check whether pg_basebackup executable is compatible with the source backup that you specify. But more info might be needed for such validation. The current implementation bail out with an error if the header line is different from what it expect. It also reports and error if the 2nd line is not the start WAL location. That's all that pg_basebackup needs to start a new incremental backup. All the other information are useful to reconstruct a full backup in case of an incremental backup, or maybe to check the completeness of an archived full backup. Initially the profile was present only in incremental backups, but after some discussion on list we agreed to always write it. Don't we need more checks about the compatibility of the backup-target database cluster and the source incremental backup? Without such more checks, I'm afraid we can easily get a corrupted incremental backups. For example, pg_basebackup should emit an error if the target and source have the different system IDs, like walreceiver does? What happens if the timeline ID is different between the source and target? What happens if the source was taken from the standby but new incremental backup will be taken from the master? Do we need to check them? Sorry if this has been already discussed so far. Why is a backup profile file necessary? Maybe it's necessary in the future, but currently seems not. It's necessary because it's the only way to detect deleted files. Maybe I'm missing something. Seems we can detect that even without a profile. For example, please imagine the case where the file has been deleted since the last full backup and then the incremental backup is taken. In this case, that deleted file exists only in the full backup. We can detect the deletion of the file by checking both full and incremental backups. When you take an incremental backup, only changed files are sent. Without the backup_profile in the incremental backup, you cannot detect a deleted file, because it's indistinguishable from a file that is not changed. Yeah, you're right! We've really gotten the consensus about the current design, especially that every files basically need to be read to check whether they have been modified since last backup even when *no* modification happens since last backup? The real problem here is that there is currently no way to detect that a file is not changed since the last backup. We agreed to not use file system timestamps as they are not reliable for that purpose. TBH I prefer timestamp-based approach in the first version of incremental backup even if's less reliable than LSN-based one. I think that some users who are using timestamp-based rsync (i.e., default mode) for the backup would be satisfied with timestamp-based one. The original design was to compare size+timestamp+checksums (only if everything else matches and the file has been modified after the start of the backup), but the feedback from the list was that we cannot trust the filesystem mtime and we must use LSN instead. Using LSN have a significant advantage over using checksum, as we can start
Re: [HACKERS] Patch: raise default for max_wal_segments to 1GB
On Tue, Mar 3, 2015 at 4:25 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 03/03/2015 01:43 AM, Andres Freund wrote: On 2015-03-02 15:40:27 -0800, Josh Berkus wrote: ! #max_wal_size = 1GB # in logfile segments Independent of the rest of the changes, the in logfile segments bit should probably be changed. The base unit is still logfile segments, so if you write just max_wal_size = 10 without a unit, that means 160MB. That's what the comment means. Is it needed? Many settings have a similar comment, but most don't. Most of the ones that do have a magic value 0, -1 as default, so that it's not obvious from the default what the unit is. For example: #tcp_keepalives_idle = 0# TCP_KEEPIDLE, in seconds; # 0 selects the system default #tcp_keepalives_interval = 0# TCP_KEEPINTVL, in seconds; # 0 selects the system default #temp_file_limit = -1 # limits per-session temp file space # in kB, or -1 for no limit #commit_delay = 0 # range 0-10, in microseconds #log_temp_files = -1# log temporary files equal or larger # than the specified size in kilobytes; # -1 disables, 0 logs all temp files #statement_timeout = 0 # in milliseconds, 0 is disabled #lock_timeout = 0 # in milliseconds, 0 is disabled #wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables But there are also: #wal_receiver_timeout = 60s # time that receiver waits for # communication from master # in milliseconds; 0 disables #autovacuum_vacuum_cost_delay = 20ms# default vacuum cost delay for # autovacuum, in milliseconds; # -1 means use vacuum_cost_delay I propose that we remove the comment from max_wal_size, and also remove the in milliseconds from wal_receiver_timeout and autovacuum_vacuum_cost_delay. Seems OK to me. BTW, we can also remove in milliseconds from wal_sender_timeout. (and we should also change wal_keep_segments to accept MB/GB, as discussed already) +1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Mar 3, 2015 at 9:34 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Mar 3, 2015 at 9:24 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-03-03 08:59:30 +0900, Michael Paquier wrote: Already mentioned upthread, but I agree with Fujii-san here: adding information related to the state of a block image in XLogRecordBlockHeader makes little sense because we are not sure to have a block image, perhaps there is only data associated to it, and that we should control that exclusively in XLogRecordBlockImageHeader and let the block ID alone for now. This argument doesn't make much sense to me. The flag byte could very well indicate 'block reference without image following' vs 'block reference with data + hole following' vs 'block reference with compressed data following'. Information about the state of a block is decoupled with its existence, aka in the block header, we should control if: - record has data - record has a block And in the block image header, we control if the block is: - compressed or not - has a hole or not. Are there any other flag bits that we should or are planning to add into WAL header newly, except the above two? If yes and they are required by even a block which doesn't have an image, I will change my mind and agree to add something like chunk ID to a block header. But I guess the answer of the question is No. Since the flag bits now we are thinking to add are required only by a block having an image, adding them into a block header (instead of block image header) seems a waste of bytes in WAL. So I concur with Michael. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Obsolete SnapshotNow reference within snapbuild.c
On Wed, Mar 4, 2015 at 10:31 AM, Peter Geoghegan p...@heroku.com wrote: SnapBuildCommitTxn() has what I gather is an obsolete reference to SnapshotNow(). Attached patch corrects this. Pushed. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] File based Incremental backup v8
On Tue, Mar 3, 2015 at 12:36 AM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: Il 02/03/15 14:21, Fujii Masao ha scritto: On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: Hi, I've attached an updated version of the patch. basebackup.c:1565: warning: format '%lld' expects type 'long long int', but argument 8 has type '__off_t' basebackup.c:1565: warning: format '%lld' expects type 'long long int', but argument 8 has type '__off_t' pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code I'll add the an explicit cast at that two lines. When I applied three patches and compiled the code, I got the above warnings. How can we get the full backup that we can use for the archive recovery, from the first full backup and subsequent incremental backups? What commands should we use for that, for example? It's better to document that. I've sent a python PoC that supports the plain format only (not the tar one). I'm currently rewriting it in C (with also the tar support) and I'll send a new patch containing it ASAP. Yeah, if special tool is required for that purpose, the patch should include it. What does 1 of the heading line in backup_profile mean? Nothing. It's a version number. If you think it's misleading I will remove it. A version number of file format of backup profile? If it's required for the validation of backup profile file as a safe-guard, it should be included in the profile file. For example, it might be useful to check whether pg_basebackup executable is compatible with the source backup that you specify. But more info might be needed for such validation. Sorry if this has been already discussed so far. Why is a backup profile file necessary? Maybe it's necessary in the future, but currently seems not. It's necessary because it's the only way to detect deleted files. Maybe I'm missing something. Seems we can detect that even without a profile. For example, please imagine the case where the file has been deleted since the last full backup and then the incremental backup is taken. In this case, that deleted file exists only in the full backup. We can detect the deletion of the file by checking both full and incremental backups. We've really gotten the consensus about the current design, especially that every files basically need to be read to check whether they have been modified since last backup even when *no* modification happens since last backup? The real problem here is that there is currently no way to detect that a file is not changed since the last backup. We agreed to not use file system timestamps as they are not reliable for that purpose. TBH I prefer timestamp-based approach in the first version of incremental backup even if's less reliable than LSN-based one. I think that some users who are using timestamp-based rsync (i.e., default mode) for the backup would be satisfied with timestamp-based one. Using LSN have a significant advantage over using checksum, as we can start the full copy as soon as we found a block whith a LSN greater than the threshold. There are two cases: 1) the file is changed, so we can assume that we detect it after reading 50% of the file, then we send it taking advantage of file system cache; 2) the file is not changed, so we read it without sending anything. It will end up producing an I/O comparable to a normal backup. Yeah, it might make the situation better than today. But I'm afraid that many users might get disappointed about that behavior of an incremental backup after the release... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: raise default for max_wal_segments to 1GB
On Tue, Mar 3, 2015 at 8:51 AM, Josh Berkus j...@agliodbs.com wrote: On 03/02/2015 03:43 PM, Andres Freund wrote: Hi, On 2015-03-02 15:40:27 -0800, Josh Berkus wrote: ! #max_wal_size = 1GB# in logfile segments Independent of the rest of the changes, the in logfile segments bit should probably be changed. Point! Although I think it's fair to point out that that wasn't my omission, but Heikki's. Version C, with that correction, attached. Minor comments: The default settings are 5 minutes and 128 MB, respectively. in wal.sgml needs to be updated. intmax_wal_size = 8;/* 128 MB */ It's better to update the above code in xlog.c. That's not essential, though. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] File based Incremental backup v8
On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: Hi, I've attached an updated version of the patch. basebackup.c:1565: warning: format '%lld' expects type 'long long int', but argument 8 has type '__off_t' basebackup.c:1565: warning: format '%lld' expects type 'long long int', but argument 8 has type '__off_t' pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code When I applied three patches and compiled the code, I got the above warnings. How can we get the full backup that we can use for the archive recovery, from the first full backup and subsequent incremental backups? What commands should we use for that, for example? It's better to document that. What does 1 of the heading line in backup_profile mean? Sorry if this has been already discussed so far. Why is a backup profile file necessary? Maybe it's necessary in the future, but currently seems not. Several infos like LSN, modification time, size, etc are tracked in a backup profile file for every backup files, but they are not used for now. If it's now not required, I'm inclined to remove it to simplify the code. We've really gotten the consensus about the current design, especially that every files basically need to be read to check whether they have been modified since last backup even when *no* modification happens since last backup? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On Mon, Mar 2, 2015 at 8:37 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 03/02/2015 11:53 AM, Fujii Masao wrote: On Sat, Feb 28, 2015 at 5:00 AM, Josh Berkus j...@agliodbs.com wrote: On 11/10/2014 10:54 AM, Magnus Hagander wrote: On Mon, Nov 10, 2014 at 7:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: pg_standby is more configurable than the built-in standby_mode=on. You can set the sleep time, for example, while standby_mode=on uses a hard-coded delay of 5 s. And pg_standby has a configurable maximum wait time. And as Fujii pointed out, the built-in system will print an annoying message to the log every time it attempts to restore a file. Nevertheless, 99% of users would probably be happy with the built-in thing. As long as pg_standby has features that are actually useful and that are not in the built-in system, we shouldn't remove it. We should, however, try to fix those in the main system so we can get rid of it after that :) As of current 9.5, we have configurable retries and standby delay in mainstream. Is there some reason we still need pg_standby? Yes, it's not easy to perform fast failover without pg_standby for now. What is fast failover? Per document, -- In fast failover, the server is brought up immediately. Any WAL files in the archive that have not yet been applied will be ignored, and all transactions in those files are lost. To trigger a fast failover, create a trigger file and write the word fast into it. pg_standby can also be configured to execute a fast failover automatically if no new WAL file appears within a defined interval. -- Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On Sat, Feb 28, 2015 at 5:00 AM, Josh Berkus j...@agliodbs.com wrote: On 11/10/2014 10:54 AM, Magnus Hagander wrote: On Mon, Nov 10, 2014 at 7:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: pg_standby is more configurable than the built-in standby_mode=on. You can set the sleep time, for example, while standby_mode=on uses a hard-coded delay of 5 s. And pg_standby has a configurable maximum wait time. And as Fujii pointed out, the built-in system will print an annoying message to the log every time it attempts to restore a file. Nevertheless, 99% of users would probably be happy with the built-in thing. As long as pg_standby has features that are actually useful and that are not in the built-in system, we shouldn't remove it. We should, however, try to fix those in the main system so we can get rid of it after that :) As of current 9.5, we have configurable retries and standby delay in mainstream. Is there some reason we still need pg_standby? Yes, it's not easy to perform fast failover without pg_standby for now. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, the attached is the v4 patch that checks feedback timing every WAL segments boundary. At Fri, 20 Feb 2015 17:29:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote in 20150220.172914.241732690.horiguchi.kyot...@lab.ntt.co.jp Some users may complain about the performance impact by such frequent calls and we may want to get rid of them from walreceiver loop in the future. If we adopt your idea now, I'm afraid that it would tie our hands in that case. How much impact can such frequent calls of gettimeofday() have on replication performance? If it's not negligible, probably we should remove them at first and find out another idea to fix the problem you pointed. ISTM that it's not so difficult to remove them. Thought? Do you have any numbers which can prove that such frequent gettimeofday() has only ignorable impact on the performance? The attached patch is 'the more sober' version of SIGLARM patch. I said that checking whether to send feedback every boundary between WAL segments seemed too long but after some rethinking, I changed my mind. - The most large possible delay source in the busy-receive loop is fsyncing at closing WAL segment file just written, this can take several seconds. Freezing longer than the timeout interval could not be saved and is not worth saving anyway. - 16M bytes-disk-writes intervals between gettimeofday() seems to be gentle enough even on platforms where gettimeofday() is rather heavy. Sounds reasonable to me. So we don't need to address the problem in walreceiver side so proactively because it tries to send the feedback every after flushing the WAL records. IOW, the problem you observed is less likely to happen. Right? +now = feGetCurrentTimestamp(); +if (standby_message_timeout 0 Minor comment: should feGetCurrentTimestamp() be called after the check of standby_message_timeout 0, to avoid unnecessary calls of that? ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len, XLogRecPtr *blockpos, uint32 timeline, char *basedir, stream_stop_callback stream_stop, - char *partial_suffix, bool mark_done) + char *partial_suffix, bool mark_done, + int standby_message_timeout, int64 *last_status) Maybe it's time to refactor this ugly coding (i.e., currently many arguments need to be given to each functions. Looks ugly)... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed rahilasye...@gmail.com wrote: Even this patch doesn't work fine. The standby emit the following error messages. Yes this bug remains unsolved. I am still working on resolving this. Following chunk IDs have been added in the attached patch as suggested upthread. +#define XLR_CHUNK_BLOCK_REFERENCE 0x10 +#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04 +#define XLR_CHUNK_BLOCK_HAS_DATA 0x08 XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references. XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA. Before sending a new version, be sure that this get fixed by for example building up a master with a standby replaying WAL, and running make installcheck-world or similar. If the standby does not complain at all, you have good chances to not have bugs. You could also build with WAL_DEBUG to check record consistency. +1 When I test the WAL or replication related features, I usually run make installcheck and pgbench against the master at the same time after setting up the replication environment. typedef struct XLogRecordBlockHeader { +uint8chunk_id;/* xlog fragment id */ uint8id;/* block reference ID */ Seems this increases the header size of WAL record even if no backup block image is included. Right? Isn't it better to add the flag info about backup block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader? Originally we borrowed one or two bits from its existing fields to minimize the header size, but we can just add new flag field if we prefer the extensibility and readability of the code. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Feb 24, 2015 at 6:46 PM, Syed, Rahila rahila.s...@nttdata.com wrote: Hello , I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the startup process could not replay any WAL records when I set up replication and enabled wal_compression. LOG: record with invalid length at 0/3B0 LOG: record with invalid length at 0/3000518 LOG: Invalid block length in record 0/30005A0 LOG: Invalid block length in record 0/3000D60 ... Please fine attached patch which replays WAL records. Even this patch doesn't work fine. The standby emit the following error messages. LOG: invalid block_id 255 at 0/3B0 LOG: record with invalid length at 0/30017F0 LOG: invalid block_id 255 at 0/3001878 LOG: record with invalid length at 0/30027D0 LOG: record with invalid length at 0/3002E58 ... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring GUC unit conversions
On Fri, Feb 13, 2015 at 10:26 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: In the redesign checkpoint_segments patch, Robert suggested keeping the settings' base unit as number of segments, but allow conversions from MB, GB etc. I started looking into that and found that adding a new unit to guc.c is quite messy. The conversions are done with complicated if-switch-case constructs. Attached is a patch to refactor that, making the conversions table-driven. This doesn't change functionality, it just makes the code nicer. Isn't it good idea to allow even wal_keep_segments to converse from MB, GB etc? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On Tue, Feb 24, 2015 at 1:29 AM, David Steele da...@pgmasters.net wrote: On 2/18/15 10:25 AM, David Steele wrote: On 2/18/15 6:11 AM, Fujii Masao wrote: The pg_audit doesn't log BIND parameter values when prepared statement is used. Seems this is an oversight of the patch. Or is this intentional? It's actually intentional - following the model I talked about in my earlier emails, the idea is to log statements only. This also follows on 2ndQuadrant's implementation. Unfortunately, I think it's beyond the scope of this module to log bind variables. Maybe I can live with that at least in the first version. I'm following not only 2ndQuadrant's implementation, but Oracle's as well. Oracle's audit_trail (e.g., = db, extended) can log even bind values. Also log_statement=on in PostgreSQL also can log bind values. Maybe we can reuse the same technique that log_statement uses. Logging values is interesting, but I'm sure the user would want to specify which columns to log, which I felt was beyond the scope of the patch. The pg_audit cannot log the statement like SELECT 1 which doesn't access to the database object. Is this intentional? I think that there are many users who want to audit even such statement. I think I see how to make this work. I'll work on it for the next version of the patch. This has been fixed in the v2 patch. Thanks! Imagine the case where you call the user-defined function which executes many nested statements. In this case, pg_audit logs only top-level statement (i.e., issued directly by client) every time nested statement is executed. In fact, one call of such UDF can cause lots of *same* log messages. I think this is problematic. I agree - not sure how to go about addressing it, though. I've tried to cut down on the verbosity of the logging in general, but of course it can still be a problem. Using security definer and a different logging GUC for the defining role might work. I'll add that to my unit tests and see what happens. That didn't work - but I didn't really expect it to. Here are two options I thought of: 1) Follow Oracle's as session option and only log each statement type against an object the first time it happens in a session. This would greatly reduce logging, but might be too little detail. It would increase the memory footprint of the module to add the tracking. 2) Only log once per call to the backend. Essentially, we would only log the statement you see in pg_stat_activity. This could be a good option because it logs what the user accesses directly, rather than functions, views, etc. which hopefully are already going through a review process and can be audited that way. Would either of those address your concerns? Before discussing how to implement, probably we need to consider the spec about this. For example, should we log even nested statements for the audit purpose? If yes, how should we treat the case where the user changes the setting so that only DDL is logged, and then the user-defined function which internally executes DDL is called? Since the top-level SQL (calling the function) is not the target of audit, we should not log even the nested DDL? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On Thu, Feb 26, 2015 at 1:40 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Fujii Masao wrote: On Tue, Feb 24, 2015 at 1:29 AM, David Steele da...@pgmasters.net wrote: 1) Follow Oracle's as session option and only log each statement type against an object the first time it happens in a session. This would greatly reduce logging, but might be too little detail. It would increase the memory footprint of the module to add the tracking. Doesn't this mean that a user could conceal malicious activity simply by doing a innocuous query that silences all further activity? 2) Only log once per call to the backend. Essentially, we would only log the statement you see in pg_stat_activity. This could be a good option because it logs what the user accesses directly, rather than functions, views, etc. which hopefully are already going through a review process and can be audited that way. ... assuming the user does not create stuff on the fly behind your back. Would either of those address your concerns? Before discussing how to implement, probably we need to consider the spec about this. For example, should we log even nested statements for the audit purpose? If yes, how should we treat the case where the user changes the setting so that only DDL is logged, and then the user-defined function which internally executes DDL is called? Since the top-level SQL (calling the function) is not the target of audit, we should not log even the nested DDL? Clearly if you log only DROP TABLE, and then the malicious user hides one such call inside a function that executes the drop (which is called via a SELECT top-level SQL), you're not going to be happy. Yep, so what SQL should be logged in this case? Only targeted nested DDL? Both top and nested ones? Seems the later is better to me. What about the case where the function A calls the function B executing DDL? Every ancestor SQLs of the targeted DDL should be logged? Probably yes. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add generate_series(numeric, numeric)
On Wed, Jan 14, 2015 at 11:04 AM, Ali Akbar the.ap...@gmail.com wrote: 2014-12-18 19:35 GMT+07:00 Fujii Masao masao.fu...@gmail.com: On Mon, Dec 15, 2014 at 2:38 PM, Andrew Gierth and...@tao11.riddles.org.uk wrote: I was thinking something like this, added just after that para: warning para While the actual arguments to the function remain unchanged between calls, if you detoast the argument values (which is normally done transparently by the functionPG_GETARG_replaceablexxx/replaceable/function macro) in the transient context then the detoasted copies will be freed on each cycle. Accordingly, if you keep references to such values in your structfielduser_fctx/, you must either copy them into the structfieldmulti_call_memory_ctx/ after detoasting, or ensure that you detoast the values only in that context. /para /warning I'm OK with this. Wrapping the doc changes in a patch. Will add to next commitfest so it won't be lost. Pushed. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Fri, Feb 20, 2015 at 9:33 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote: - * Wait for more WAL to arrive. Time out after 5 seconds, - * like when polling the archive, to react to a trigger - * file promptly. + * Wait for more WAL to arrive. Time out after the amount of + * time specified by wal_retrieve_retry_interval, like + * when polling the archive, to react to a trigger file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + wal_retrieve_retry_interval * 1000L); This change can prevent the startup process from reacting to a trigger file. Imagine the case where the large interval is set and the user want to promote the standby by using the trigger file instead of pg_ctl promote. I think that the sleep time should be 5s if the interval is set to more than 5s. Thought? I disagree here. It is interesting to accelerate the check of WAL availability from a source in some cases for replication, but the opposite is true as well as mentioned by Alexey at the beginning of the thread to reduce the number of requests when requesting WAL archives from an external storage type AWS. Hence a correct solution would be to check periodically for the trigger file with a maximum one-time wait of 5s to ensure backward-compatible behavior. We could reduce it to 1s or something like that as well. You seem to have misunderstood the code in question. Or I'm missing something. The timeout of the WaitLatch is just the interval to check for the trigger file while waiting for more WAL to arrive from streaming replication. Not related to the retry time to restore WAL from the archive. [Re-reading the code...] Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a maximum of 5s then. I also noticed in previous patch that the wait was maximized to 5s. To begin with, a loop should have been used if it was a sleep, but as now patch uses a latch this limit does not make much sense... Patch updated is attached. On second thought, the interval to check the trigger file is very different from the wait time to retry to retrieve WAL. So it seems strange and even confusing to control them by one parameter. If we really want to make the interval for the trigger file configurable, we should invent new GUC for it. But I don't think that it's worth doing that. If someone wants to react the trigger file more promptly for fast promotion, he or she basically can use pg_ctl promote command, instead. Thought? Hm, OK. Attached is the updated version of the patch. I changed the parameter so that it doesn't affect the interval of checking the trigger file. -static pg_time_t last_fail_time = 0; -pg_time_tnow; +TimestampTz now = GetCurrentTimestamp(); +TimestampTzlast_fail_time = now; I reverted the code here as it was. I don't think GetCurrentTimestamp() needs to be called for each WaitForWALToBecomeAvailable(). +WaitLatch(XLogCtl-recoveryWakeupLatch, + WL_LATCH_SET | WL_TIMEOUT, + wait_time / 1000); Don't we need to specify WL_POSTMASTER_DEATH flag here? Added. Yeah, I am wondering though why this has not been added after 89fd72cb though. +{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS, WAL_SETTINGS should be REPLICATION_STANDBY? Changed. Sure. So I pushed the patch. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Attached is a patch which has following changes, As suggested above block ID in xlog structs has been replaced by chunk ID. Chunk ID is used to distinguish between different types of xlog record fragments. Like, XLR_CHUNK_ID_DATA_SHORT XLR_CHUNK_ID_DATA_LONG XLR_CHUNK_BKP_COMPRESSED XLR_CHUNK_BKP_WITH_HOLE In block references, block ID follows the chunk ID. Here block ID retains its functionality. This approach increases data by 1 byte for each block reference in an xlog record. This approach separates ID referring different fragments of xlog record from the actual block ID which is used to refer block references in xlog record. I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the startup process could not replay any WAL records when I set up replication and enabled wal_compression. LOG: record with invalid length at 0/3B0 LOG: record with invalid length at 0/3000518 LOG: Invalid block length in record 0/30005A0 LOG: Invalid block length in record 0/3000D60 ... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote: - * Wait for more WAL to arrive. Time out after 5 seconds, - * like when polling the archive, to react to a trigger - * file promptly. + * Wait for more WAL to arrive. Time out after the amount of + * time specified by wal_retrieve_retry_interval, like + * when polling the archive, to react to a trigger file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + wal_retrieve_retry_interval * 1000L); This change can prevent the startup process from reacting to a trigger file. Imagine the case where the large interval is set and the user want to promote the standby by using the trigger file instead of pg_ctl promote. I think that the sleep time should be 5s if the interval is set to more than 5s. Thought? I disagree here. It is interesting to accelerate the check of WAL availability from a source in some cases for replication, but the opposite is true as well as mentioned by Alexey at the beginning of the thread to reduce the number of requests when requesting WAL archives from an external storage type AWS. Hence a correct solution would be to check periodically for the trigger file with a maximum one-time wait of 5s to ensure backward-compatible behavior. We could reduce it to 1s or something like that as well. You seem to have misunderstood the code in question. Or I'm missing something. The timeout of the WaitLatch is just the interval to check for the trigger file while waiting for more WAL to arrive from streaming replication. Not related to the retry time to restore WAL from the archive. [Re-reading the code...] Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a maximum of 5s then. I also noticed in previous patch that the wait was maximized to 5s. To begin with, a loop should have been used if it was a sleep, but as now patch uses a latch this limit does not make much sense... Patch updated is attached. On second thought, the interval to check the trigger file is very different from the wait time to retry to retrieve WAL. So it seems strange and even confusing to control them by one parameter. If we really want to make the interval for the trigger file configurable, we should invent new GUC for it. But I don't think that it's worth doing that. If someone wants to react the trigger file more promptly for fast promotion, he or she basically can use pg_ctl promote command, instead. Thought? Attached is the updated version of the patch. I changed the parameter so that it doesn't affect the interval of checking the trigger file. -static pg_time_t last_fail_time = 0; -pg_time_tnow; +TimestampTz now = GetCurrentTimestamp(); +TimestampTzlast_fail_time = now; I reverted the code here as it was. I don't think GetCurrentTimestamp() needs to be called for each WaitForWALToBecomeAvailable(). +WaitLatch(XLogCtl-recoveryWakeupLatch, + WL_LATCH_SET | WL_TIMEOUT, + wait_time / 1000); Don't we need to specify WL_POSTMASTER_DEATH flag here? Added. +{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS, WAL_SETTINGS should be REPLICATION_STANDBY? Changed. Regards, -- Fujii Masao *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 2985,2990 include_dir 'conf.d' --- 2985,3008 /listitem /varlistentry + varlistentry id=guc-wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval + termvarnamewal_retrieve_retry_interval/varname (typeinteger/type) + indexterm +primaryvarnamewal_retrieve_retry_interval/ configuration parameter/primary + /indexterm + /term + listitem +para + Specify the amount of time, in milliseconds, to wait when WAL is not + available from any sources (streaming replication, + local filenamepg_xlog/ or WAL archive) before retrying to + retrieve WAL. This parameter can only be set in the + filenamepostgresql.conf/ file or on the server command line. + The default value is 5 seconds. +/para + /listitem + /varlistentry + /variablelist /sect2 /sect1 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 93,98 int sync_method = DEFAULT_SYNC_METHOD; --- 93,99 int wal_level
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
On Wed, Feb 18, 2015 at 5:34 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, this is the last patch for pg_basebackup/pg_receivexlog on master (9.5). Preor versions don't have this issue. 4. basebackup_reply_fix_mst_v2.patch receivelog.c patch applyable on master. This is based on the same design with walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday(). Thanks for updating the patches! But I'm still not sure if the idea depending on the frequent calls of gettimeofday() for each WAL receive is good or not. Some users may complain about the performance impact by such frequent calls and we may want to get rid of them from walreceiver loop in the future. If we adopt your idea now, I'm afraid that it would tie our hands in that case. How much impact can such frequent calls of gettimeofday() have on replication performance? If it's not negligible, probably we should remove them at first and find out another idea to fix the problem you pointed. ISTM that it's not so difficult to remove them. Thought? Do you have any numbers which can prove that such frequent gettimeofday() has only ignorable impact on the performance? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On Wed, Feb 18, 2015 at 1:26 AM, David Steele da...@pgmasters.net wrote: On 2/17/15 10:23 AM, Simon Riggs wrote: I vote to include pgaudit in 9.5, albeit with any changes. In particular, David may have some changes to recommend, but I haven't seen a spec or a patch, just a new version of code (which isn't how we do things...). I submitted the new patch in my name under a separate thread Auditing extension for PostgreSQL (Take 2) (54e005cc.1060...@pgmasters.net) I played the patch version of pg_audit a bit and have basic comments about its spec. The pg_audit doesn't log BIND parameter values when prepared statement is used. Seems this is an oversight of the patch. Or is this intentional? The pg_audit cannot log the statement like SELECT 1 which doesn't access to the database object. Is this intentional? I think that there are many users who want to audit even such statement. Imagine the case where you call the user-defined function which executes many nested statements. In this case, pg_audit logs only top-level statement (i.e., issued directly by client) every time nested statement is executed. In fact, one call of such UDF can cause lots of *same* log messages. I think this is problematic. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments
On Fri, Feb 13, 2015 at 9:18 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-12 11:44:05 -0800, Sergey Konoplev wrote: On Thu, Feb 12, 2015 at 11:40 AM, Andres Freund and...@2ndquadrant.com wrote: This obviously should not be the case. I'll have a look in a couple of hours. Until then you can likely just work around the problem by creating the archive_status directory. Thank you. Just let me know if you need some extra info or debugging. No need for debugging. It's plain and simply a (cherry-pick) conflict I resolved wrongly during backpatching. 9.3, 9.4 and master do not have that problem. That whole fix was quite painful because every single release had significantly different code :(. pg_basebackup/ is pretty messy. I'm not sure why my testsuite didn't trigger that problem. Possibly because a retry makes things work :( Somewhat uckily it's 9.2 only (9.3, 9.4 and master look correct, earlier releases don't have pg_receivexlog) Are you planning to back-patch the fix to 9.2? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, The attached patch is fix for walreciever not using gettimeofday, and fix for receivelog using it. XLogWalRcvProcessMsg doesn't send feedback when processing 'w'=WAL record packet. So the same thing as pg_basebackup and pg_receivexlog will occur on walsender. Exiting the for(;;) loop by TimestampDifferenceExceeds just before line 439 is an equivalent measure to I poposed for receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there is seemingly simpler (but I feel a bit uncomfortable for the latter) I'm concerned about the performance degradation by calling getimeofday() per a record. Or other measures? Introduce the maximum number of records that we can receive and process between feedbacks? For example, we can change XLogWalRcvSendHSFeedback() so that it's called per at least 8 records. I'm not sure what number is good, though... At the beginning of the while(len 0){if (len 0){ block, last_recv_timestamp is always kept up to date (by using gettimeotda():). So we can use the variable instead of gettimeofday() in my previous proposal. Yes, but something like last_recv_timestamp doesn't exist in REL9_1_STABLE. So you don't think that your proposed fix should be back-patched to all supported versions. Right? The start time of the timeout could be last_recv_timestamp at the beginning of the while loop, since it is a bit earlier than the actual time at the point. Sounds strange to me. I think that last_recv_timestamp should be compared with the time when the last feedback message was sent, e.g., maybe you can expose sendTime in XLogWalRcvSendReplay() as global variable, and use it to compare with last_recv_timestamp. When we get out of the WAL receive loop due to the timeout check which your patch added, XLogWalRcvFlush() is always executed. I don't think that current WAL file needs to be flushed at that time. The another solution would be using RegisterTimeout/enable_timeout_after and it seemed to be work for me. It can throw out the time counting stuff in the loop we are talking about and that of XLogWalRcvSendHSFeedback and XLogWalRcvSendReply, but it might be a bit too large for the gain. Yes, sounds overkill. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote: - * Wait for more WAL to arrive. Time out after 5 seconds, - * like when polling the archive, to react to a trigger - * file promptly. + * Wait for more WAL to arrive. Time out after the amount of + * time specified by wal_retrieve_retry_interval, like + * when polling the archive, to react to a trigger file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + wal_retrieve_retry_interval * 1000L); This change can prevent the startup process from reacting to a trigger file. Imagine the case where the large interval is set and the user want to promote the standby by using the trigger file instead of pg_ctl promote. I think that the sleep time should be 5s if the interval is set to more than 5s. Thought? I disagree here. It is interesting to accelerate the check of WAL availability from a source in some cases for replication, but the opposite is true as well as mentioned by Alexey at the beginning of the thread to reduce the number of requests when requesting WAL archives from an external storage type AWS. Hence a correct solution would be to check periodically for the trigger file with a maximum one-time wait of 5s to ensure backward-compatible behavior. We could reduce it to 1s or something like that as well. You seem to have misunderstood the code in question. Or I'm missing something. The timeout of the WaitLatch is just the interval to check for the trigger file while waiting for more WAL to arrive from streaming replication. Not related to the retry time to restore WAL from the archive. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The return value of allocate_recordbuf()
On Mon, Jan 5, 2015 at 1:25 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jan 1, 2015 at 1:10 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Dec 29, 2014 at 6:14 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Hmm. There is no way to check beforehand if a palloc() will fail because of OOM. We could check for MaxAllocSize, though. I think we need a version of palloc that returns NULL instead of throwing an error. The error-throwing behavior is for the best in almost every case, but I think the no-error version would find enough users to be worthwhile. Compression is one of those areas, be it compression of WAL or another type. The new API would allow to fallback to the non-compression code path if buffer allocation for compression cannot be done because of an OOM. FWIW, I actually looked at how to do that a couple of weeks back, and you just need a wrapper function, whose content is the existing AllocSetAlloc, taking an additional boolean flag to trigger an ERROR or leave with NULL if an OOM appears. On top of that we will need a new method in MemoryContextMethods, let's call it alloc_safe, for its equivalent, the new palloc_safe. MemoryContextAllocExtended() was added, so isn't it time to replace palloc() with MemoryContextAllocExtended(CurrentMemoryContext, MCXT_ALLOC_NO_OOM) in allocate_recordbuf()? Regards, -- Fujii Masao *** a/src/backend/access/transam/xlogreader.c --- b/src/backend/access/transam/xlogreader.c *** *** 149,155 allocate_recordbuf(XLogReaderState *state, uint32 reclength) if (state-readRecordBuf) pfree(state-readRecordBuf); ! state-readRecordBuf = (char *) palloc(newSize); state-readRecordBufSize = newSize; return true; } --- 149,163 if (state-readRecordBuf) pfree(state-readRecordBuf); ! state-readRecordBuf = ! (char *) MemoryContextAllocExtended(CurrentMemoryContext, ! newSize, ! MCXT_ALLOC_NO_OOM); ! if (state-readRecordBuf == NULL) ! { ! state-readRecordBufSize = 0; ! return false; ! } state-readRecordBufSize = newSize; return true; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Feb 6, 2015 at 3:42 AM, Michael Paquier michael.paqu...@gmail.com wrote: Fujii Masao wrote: I wrote This is an inspiration from lz4 APIs. Wouldn't it be buggy for a compression algorithm to return a size of 0 bytes as compressed or decompressed length btw? We could as well make it return a negative value when a failure occurs if you feel more comfortable with it. I feel that's better. Attached is the updated version of the patch. I changed the pg_lzcompress and pg_lzdecompress so that they return -1 when failure happens. Also I applied some cosmetic changes to the patch (e.g., shorten the long name of the newly-added macros). Barring any objection, I will commit this. I just had a look at your updated version, ran some sanity tests, and things look good from me. The new names of the macros at the top of tuptoaster.c are clearer as well. Thanks for the review! Pushed! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Feb 6, 2015 at 4:15 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila rahila.s...@nttdata.com wrote: /* +* We recheck the actual size even if pglz_compress() report success, +* because it might be satisfied with having saved as little as one byte +* in the compressed data. +*/ + *len = (uint16) compressed_len; + if (*len = orig_len - 1) + return false; + return true; +} As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for storing raw_length of the compressed block. In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed length is less than original length - 2. So , IIUC the above condition should rather be If (*len = orig_len -2 ) return false; 2 should be replaced with the macro variable indicating the size of extra header for compressed backup block. Do we always need extra two bytes for compressed backup block? ISTM that extra bytes are not necessary when the hole length is zero. In this case the length of the original backup block (i.e., uncompressed) must be BLCKSZ, so we don't need to save the original size in the extra bytes. Furthermore, when fpw compression is disabled and the hole length is zero, we seem to be able to save one byte from the header of backup block. Currently we use 4 bytes for the header, 2 bytes for the length of backup block, 15 bits for the hole offset and 1 bit for the flag indicating whether block is compressed or not. But in that case, the length of backup block doesn't need to be stored because it must be BLCKSZ. Shouldn't we optimize the header in this way? Thought? +int page_len = BLCKSZ - hole_length; +char *scratch_buf; +if (hole_length != 0) +{ +scratch_buf = compression_scratch; +memcpy(scratch_buf, page, hole_offset); +memcpy(scratch_buf + hole_offset, + page + (hole_offset + hole_length), + BLCKSZ - (hole_length + hole_offset)); +} +else +scratch_buf = page; + +/* Perform compression of block */ +if (XLogCompressBackupBlock(scratch_buf, +page_len, +regbuf-compressed_page, +compress_len)) +{ +/* compression is done, add record */ +is_compressed = true; +} You can refactor XLogCompressBackupBlock() and move all the above code to it for more simplicity. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
On Fri, Feb 6, 2015 at 3:22 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Sorry, I misunderstood that. At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao masao.fu...@gmail.com wrote in cahgqgwgudgcmnhzinkd37i+jijdkruecrea1ncrs1mmte3r...@mail.gmail.com On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: I'm very sorry for confused report. The problem found in 9.4.0 and the diagnosis was mistakenly done on master. 9.4.0 has no problem of feedback delay caused by slow xlog receiving on pg_basebackup mentioned in the previous mail. But the current master still has this problem. Seems walreceiver has the same problem. No? pg_receivexlog.c would have the same problem since it uses the same function with pg_basebackup.c. The correspondent of HandleCopyStream in wansender is WalReceiverMain, and it doesn't seem to have the same kind of loop shown below. It seems to surely send feedback per one record. | r = stream_reader(); | while (r 0) | { | ... wal record processing stuff without sending feedback.. | r = stream_reader(); | } WalReceiverMain() has the similar code as follows. len = walrcv_receive(NAPTIME_PER_CYCLE, buf); if (len != 0) { for (;;) { if (len 0) { len = walrcv_receive(0, buf); } } The loop seems a bit different but eventually the same about this discussion. 408 len = walrcv_receive(NAPTIME_PER_CYCLE, buf); 409 if (len != 0) 410 { 415 for (;;) 416 { 417 if (len 0) 418 { 425XLogWalRcvProcessMsg(buf[0], buf[1], len - 1); 426 } 427-438 else {break;} 439 len = walrcv_receive(0, buf); 440 } 441 } XLogWalRcvProcessMsg doesn't send feedback when processing 'w'=WAL record packet. So the same thing as pg_basebackup and pg_receivexlog will occur on walsender. Exiting the for(;;) loop by TimestampDifferenceExceeds just before line 439 is an equivalent measure to I poposed for receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there is seemingly simpler (but I feel a bit uncomfortable for the latter) I'm concerned about the performance degradation by calling getimeofday() per a record. Or other measures? Introduce the maximum number of records that we can receive and process between feedbacks? For example, we can change XLogWalRcvSendHSFeedback() so that it's called per at least 8 records. I'm not sure what number is good, though... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Fri, Feb 6, 2015 at 1:23 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Feb 5, 2015 at 11:58 PM, Michael Paquier wrote: An updated patch is attached. I just noticed that the patch I sent was incorrect: - Parameter name was still wal_availability_check_interval and not wal_retrieve_retry_interval - Documentation was incorrect. Please use the patch attached instead for further review. Thanks for updating the patch! timestamp.c:1708: warning: implicit declaration of function 'HandleStartupProcInterrupts' I got the above warning at the compilation. +pg_usleep(wait_time); +HandleStartupProcInterrupts(); +total_time -= wait_time; It seems strange to call the startup-process-specific function in the generic function like TimestampSleepDifference(). * 5. Sleep 5 seconds, and loop back to 1. In WaitForWALToBecomeAvailable(), the above comment should be updated. - * Wait for more WAL to arrive. Time out after 5 seconds, - * like when polling the archive, to react to a trigger - * file promptly. + * Wait for more WAL to arrive. Time out after the amount of + * time specified by wal_retrieve_retry_interval, like + * when polling the archive, to react to a trigger file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + wal_retrieve_retry_interval * 1000L); This change can prevent the startup process from reacting to a trigger file. Imagine the case where the large interval is set and the user want to promote the standby by using the trigger file instead of pg_ctl promote. I think that the sleep time should be 5s if the interval is set to more than 5s. Thought? +# specifies an optional internal to wait for WAL to become available when s/internal/interval? +This parameter specifies the amount of time to wait when a failure +occurred when reading WAL from a source (be it via streaming +replication, local filenamepg_xlog/ or WAL archive) for a node +in standby mode, or when WAL is expected from a source. At least to me, it seems not easy to understand what the parameter is from this description. What about the following, for example? This parameter specifies the amount of time to wait when WAL is not available from any sources (streaming replication, local filenamepg_xlog/ or WAL archive) before retrying to retrieve WAL Isn't it better to place this parameter in postgresql.conf rather than recovery.conf? I'd like to change the value of this parameter without restarting the server. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Jan 6, 2015 at 11:09 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Jan 5, 2015 at 10:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier wrote: The patch 1 cannot be applied to the master successfully because of recent change. Yes, that's caused by ccb161b. Attached are rebased versions. - The real stuff comes with patch 2, that implements the removal of PGLZ_Header, changing the APIs of compression and decompression to pglz to not have anymore toast metadata, this metadata being now localized in tuptoaster.c. Note that this patch protects the on-disk format (tested with pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of compression and decompression look like with this patch, simply performing operations from a source to a destination: extern int32 pglz_compress(const char *source, int32 slen, char *dest, const PGLZ_Strategy *strategy); extern int32 pglz_decompress(const char *source, char *dest, int32 compressed_size, int32 raw_size); The return value of those functions is the number of bytes written in the destination buffer, and 0 if operation failed. So it's guaranteed that 0 is never returned in success case? I'm not sure if that case can really happen, though. This is an inspiration from lz4 APIs. Wouldn't it be buggy for a compression algorithm to return a size of 0 bytes as compressed or decompressed length btw? We could as well make it return a negative value when a failure occurs if you feel more comfortable with it. I feel that's better. Attached is the updated version of the patch. I changed the pg_lzcompress and pg_lzdecompress so that they return -1 when failure happens. Also I applied some cosmetic changes to the patch (e.g., shorten the long name of the newly-added macros). Barring any objection, I will commit this. Regards, -- Fujii Masao *** a/src/backend/access/heap/tuptoaster.c --- b/src/backend/access/heap/tuptoaster.c *** *** 35,43 #include access/tuptoaster.h #include access/xact.h #include catalog/catalog.h #include miscadmin.h #include utils/fmgroids.h - #include utils/pg_lzcompress.h #include utils/rel.h #include utils/typcache.h #include utils/tqual.h --- 35,43 #include access/tuptoaster.h #include access/xact.h #include catalog/catalog.h + #include common/pg_lzcompress.h #include miscadmin.h #include utils/fmgroids.h #include utils/rel.h #include utils/typcache.h #include utils/tqual.h *** *** 45,50 --- 45,70 #undef TOAST_DEBUG + /* + * The information at the start of the compressed toast data. + */ + typedef struct toast_compress_header + { + int32 vl_len_; /* varlena header (do not touch directly!) */ + int32 rawsize; + } toast_compress_header; + + /* + * Utilities for manipulation of header information for compressed + * toast entries. + */ + #define TOAST_COMPRESS_HDRSZ ((int32) sizeof(toast_compress_header)) + #define TOAST_COMPRESS_RAWSIZE(ptr) (((toast_compress_header *) ptr)-rawsize) + #define TOAST_COMPRESS_RAWDATA(ptr) \ + (((char *) ptr) + TOAST_COMPRESS_HDRSZ) + #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \ + (((toast_compress_header *) ptr)-rawsize = len) + static void toast_delete_datum(Relation rel, Datum value); static Datum toast_save_datum(Relation rel, Datum value, struct varlena * oldexternal, int options); *** *** 53,58 static bool toastid_valueid_exists(Oid toastrelid, Oid valueid); --- 73,79 static struct varlena *toast_fetch_datum(struct varlena * attr); static struct varlena *toast_fetch_datum_slice(struct varlena * attr, int32 sliceoffset, int32 length); + static struct varlena *toast_decompress_datum(struct varlena * attr); static int toast_open_indexes(Relation toastrel, LOCKMODE lock, Relation **toastidxs, *** *** 138,148 heap_tuple_untoast_attr(struct varlena * attr) /* If it's compressed, decompress it */ if (VARATT_IS_COMPRESSED(attr)) { ! PGLZ_Header *tmp = (PGLZ_Header *) attr; ! ! attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ); ! SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ); ! pglz_decompress(tmp, VARDATA(attr)); pfree(tmp); } } --- 159,166 /* If it's compressed, decompress it */ if (VARATT_IS_COMPRESSED(attr)) { ! struct varlena *tmp = attr; ! attr = toast_decompress_datum(tmp); pfree(tmp); } } *** *** 163,173 heap_tuple_untoast_attr(struct varlena * attr) /* * This is a compressed value inside of the main tuple */ ! PGLZ_Header *tmp = (PGLZ_Header *) attr; ! ! attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ); ! SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ); ! pglz_decompress(tmp, VARDATA(attr)); } else if (VARATT_IS_SHORT(attr
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
On Thu, Feb 5, 2015 at 10:20 AM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao masao.fu...@gmail.com wrote in cahgqgwgudgcmnhzinkd37i+jijdkruecrea1ncrs1mmte3r...@mail.gmail.com On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: I'm very sorry for confused report. The problem found in 9.4.0 and the diagnosis was mistakenly done on master. 9.4.0 has no problem of feedback delay caused by slow xlog receiving on pg_basebackup mentioned in the previous mail. But the current master still has this problem. Seems walreceiver has the same problem. No? pg_receivexlog.c would have the same problem since it uses the same function with pg_basebackup.c. The correspondent of HandleCopyStream in wansender is WalReceiverMain, and it doesn't seem to have the same kind of loop shown below. It seems to surely send feedback per one record. | r = stream_reader(); | while (r 0) | { | ... wal record processing stuff without sending feedback.. | r = stream_reader(); | } WalReceiverMain() has the similar code as follows. len = walrcv_receive(NAPTIME_PER_CYCLE, buf); if (len != 0) { for (;;) { if (len 0) { len = walrcv_receive(0, buf); } } Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Docs: CREATE TABLESPACE minor markup fix
On Wed, Feb 4, 2015 at 5:27 PM, Ian Barwick i...@2ndquadrant.com wrote: Hi A superfluous '/' in an xref tag is producing an unintended '' in the Warning box on this page: http://www.postgresql.org/docs/current/interactive/sql-createtablespace.html I found that logicaldecoding.sgml also has the same typo. Fixed both. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: I'm very sorry for confused report. The problem found in 9.4.0 and the diagnosis was mistakenly done on master. 9.4.0 has no problem of feedback delay caused by slow xlog receiving on pg_basebackup mentioned in the previous mail. But the current master still has this problem. Seems walreceiver has the same problem. No? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Tue, Jan 20, 2015 at 12:57 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-19 17:16:11 +0900, Michael Paquier wrote: On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com wrote: Not this patch's fault, but I'm getting a bit tired seeing the above open coded. How about adding a function that does the sleeping based on a timestamptz and a ms interval? You mean in plugins, right? I don't recall seeing similar patterns in other code paths of backend. But I think that we can use something like that in timestamp.c then because we need to leverage that between two timestamps, the last failure and now(): TimestampSleepDifference(start_time, stop_time, internal_ms); Perhaps you have something else in mind? Attached is an updated patch. Actually I came with better than last patch by using a boolean flag as return value of TimestampSleepDifference and use TimestampDifferenceExceeds directly inside it. Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching on failure I think that name isn't a very good. And its isn't very accurate either. How about wal_retrieve_retry_interval? Not very nice, but I think it's still better than the above. Fine for me. + varlistentry id=wal-availability-check-interval xreflabel=wal_availability_check_interval + termvarnamewal_availability_check_interval/varname (typeinteger/type) + indexterm +primaryvarnamewal_availability_check_interval/ recovery parameter/primary + /indexterm + /term + listitem + para +This parameter specifies the amount of time to wait when +WAL is not available for a node in recovery. Default value is +literal5s/. + /para + para +A node in recovery will wait for this amount of time if +varnamerestore_command/ returns nonzero exit status code when +fetching new WAL segment files from archive or when a WAL receiver +is not able to fetch a WAL record when using streaming replication. + /para + /listitem + /varlistentry + /variablelist Walreceiver doesn't wait that amount, but rather how long the connection is intact. And restore_command may or may not retry. I changed this paragraph as follows: +This parameter specifies the amount of time to wait when a failure +occurred when reading WAL from a source (be it via streaming +replication, local filenamepg_xlog/ or WAL archive) for a node +in standby mode, or when WAL is expected from a source. Default +value is literal5s/. Pondering more about this patch, I am getting the feeling that it is not that much a win to explain in details in the doc each failure depending on the source from which WAL is taken. But it is essential to mention that the wait is done only for a node using standby mode. Btw, I noticed two extra things that I think should be done for consistency: - We should use as well wal_retrieve_retry_interval when waiting for WAL to arrive after checking for the failures: /* -* Wait for more WAL to arrive. Time out after 5 seconds, -* like when polling the archive, to react to a trigger -* file promptly. +* Wait for more WAL to arrive. Time out after +* wal_retrieve_retry_interval ms, like when polling the archive +* to react to a trigger file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + wal_retrieve_retry_interval * 1L); - Updated patch attached. +TimestampDifference(start_time, stop_time, secs, microsecs); +pg_usleep(interval_msec * 1000L - (100L * secs + 1L * microsecs)); What if the large interval is set and a signal arrives during the sleep? I'm afraid that a signal cannot terminate the sleep on some platforms. This problem exists even now because pg_usleep is used, but the sleep time is just 5 seconds, so it's not so bad. But the patch allows a user to set large sleep time. Shouldn't we use WaitLatch or split the pg_usleep like recoveryPausesHere() does? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments
On Mon, Jan 5, 2015 at 6:22 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-05 16:22:56 +0900, Fujii Masao wrote: On Sun, Jan 4, 2015 at 5:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-03 16:03:36 +0100, Andres Freund wrote: pg_basebackup really changed heavily since it's introduction. And desparately needs some restructuring. The patch seems to break pg_receivexlog. I got the following error message while running pg_receivexlog. pg_receivexlog: could not create archive status file mmm/archive_status/00010003.done: No such file or directory Dang. Stupid typo. And my tests didn't catch it, because I had archive_directory in the target directory :( At least it's only broken in master :/ Thanks for the catch. Do you have some additional testsuite or did you catch it manually? Manually... I just tested the tools and options which the patch may affect... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Sat, Jan 3, 2015 at 1:52 AM, Bruce Momjian br...@momjian.us wrote: On Fri, Jan 2, 2015 at 10:15:57AM -0600, k...@rice.edu wrote: On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote: On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote: I still don't understand the value of adding WAL compression, given the high CPU usage and minimal performance improvement. The only big advantage is WAL storage, but again, why not just compress the WAL file when archiving. before: pg_xlog is 800GB after: pg_xlog is 600GB. I'm damned sure that many people would be happy with that, even if the *per backend* overhead is a bit higher. And no, compression of archives when archiving helps *zap* with that (streaming, wal_keep_segments, checkpoint_timeout). As discussed before. Greetings, Andres Freund +1 On an I/O constrained system assuming 50:50 table:WAL I/O, in the case above you can process 100GB of transaction data at the cost of a bit more CPU. OK, so given your stats, the feature give a 12.5% reduction in I/O. If that is significant, shouldn't we see a performance improvement? If we don't see a performance improvement, is I/O reduction worthwhile? Is it valuable in that it gives non-database applications more I/O to use? Is that all? I suggest we at least document that this feature as mostly useful for I/O reduction, and maybe say CPU usage and performance might be negatively impacted. OK, here is the email I remember from Fujii Masao this same thread that showed a performance improvement for WAL compression: http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hnr7ly516kft5ffqf4pp1q...@mail.gmail.com Why are we not seeing the 33% compression and 15% performance improvement he saw? Because the benchmarks I and Michael used are very difffernet. I just used pgbench, but he used his simple test SQLs (see http://www.postgresql.org/message-id/cab7npqsc97o-ue5paxfmukwcxe_jioyxo1m4a0pmnmyqane...@mail.gmail.com). Furthermore, the data type of pgbench_accounts.filler column is character(84) and its content is empty, so pgbench_accounts is very compressible. This is one of the reasons I could see good performance improvement and high compression ratio. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Sat, Jan 3, 2015 at 2:24 AM, Bruce Momjian br...@momjian.us wrote: On Fri, Jan 2, 2015 at 02:18:12PM -0300, Claudio Freire wrote: On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund and...@2ndquadrant.com wrote: , I now see the compression patch as something that has negatives, so has to be set by the user, and only wins in certain cases. I am disappointed, and am trying to figure out how this became such a marginal win for 9.5. :-( I find the notion that a multi digit space reduction is a marginal win pretty ridiculous and way too narrow focused. Our WAL volume is a *significant* problem in the field. And it mostly consists out of FPWs spacewise. One thing I'd like to point out, is that in cases where WAL I/O is an issue (ie: WAL archiving), usually people already compress the segments during archiving. I know I do, and I know it's recommended on the web, and by some consultants. So, I wouldn't want this FPW compression, which is desirable in replication scenarios if you can spare the CPU cycles (because of streaming), adversely affecting WAL compression during archiving. To be specific, desirable in streaming replication scenarios that don't use SSL compression. (What percentage is that?) It is something we should mention in the docs for this feature? Even if SSL is used in replication, FPW compression is useful. It can reduce the amount of I/O in the standby side. Sometimes I've seen walreceiver's I/O had become a performance bottleneck especially in synchronous replication cases. FPW compression can be useful for those cases, for example. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Dec 26, 2014 at 4:16 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao masao.fu...@gmail.com wrote: pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend which uses those functions needs to handle PGLZ_Header. But it basically should be handled via the varlena macros. That is, the frontend still seems to need to understand the varlena datatype. I think we should avoid that. Thought? Hm, yes it may be wiser to remove it and make the data passed to pglz for varlena 8 bytes shorter.. OK, here is the result of this work, made of 3 patches. Thanks for updating the patches! The first two patches move pglz stuff to src/common and make it a frontend utility entirely independent on varlena and its related metadata. - Patch 1 is a simple move of pglz to src/common, with PGLZ_Header still present. There is nothing amazing here, and that's the broken version that has been reverted in 966115c. The patch 1 cannot be applied to the master successfully because of recent change. - The real stuff comes with patch 2, that implements the removal of PGLZ_Header, changing the APIs of compression and decompression to pglz to not have anymore toast metadata, this metadata being now localized in tuptoaster.c. Note that this patch protects the on-disk format (tested with pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of compression and decompression look like with this patch, simply performing operations from a source to a destination: extern int32 pglz_compress(const char *source, int32 slen, char *dest, const PGLZ_Strategy *strategy); extern int32 pglz_decompress(const char *source, char *dest, int32 compressed_size, int32 raw_size); The return value of those functions is the number of bytes written in the destination buffer, and 0 if operation failed. So it's guaranteed that 0 is never returned in success case? I'm not sure if that case can really happen, though. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typo in function header for recently added function errhidecontext
On Mon, Jan 5, 2015 at 3:19 PM, Amit Kapila amit.kapil...@gmail.com wrote: /* * errhidestmt --- optionally suppress CONTEXT: field of log entry * * This should only be used for verbose debugging messages where the repeated * inclusion of CONTEXT: bloats the log volume too much. */ int errhidecontext(bool hide_ctx) Here in function header, function name should be errhidecontext. Fixed. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments
On Sun, Jan 4, 2015 at 5:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-03 16:03:36 +0100, Andres Freund wrote: On 2014-12-31 16:32:19 +0100, Andres Freund wrote: On 2014-12-05 16:18:02 +0900, Fujii Masao wrote: On Fri, Dec 5, 2014 at 9:28 AM, Andres Freund and...@2ndquadrant.com wrote: So I think we just need to make pg_basebackup create to .ready files. s/.ready/.done? If yes, +1. That unfortunately requires changes to both backend and pg_basebackup to support fetch and stream modes respectively. I've attached a preliminary patch for this. I'd appreciate feedback. I plan to commit it in a couple of days, after some more testing/rereading. Attached are two updated patches that I am starting to backport now. I've fixed a couple minor oversights. And tested the patches. Pushed this after some major pain with backporting. Thanks! pg_basebackup really changed heavily since it's introduction. And desparately needs some restructuring. The patch seems to break pg_receivexlog. I got the following error message while running pg_receivexlog. pg_receivexlog: could not create archive status file mmm/archive_status/00010003.done: No such file or directory Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Pushed with some extra cosmetic tweaks. I got the following assertion failure when I executed pg_xact_commit_timestamp() in the standby server. =# select pg_xact_commit_timestamp('1000'::xid); TRAP: FailedAssertion(!(((oldestCommitTs) != ((TransactionId) 0)) == ((newestCommitTs) != ((TransactionId) 0))), File: commit_ts.c, Line: 315) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2014-12-04 12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by signal 6: Aborted 2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running: select pg_xact_commit_timestamp('1000'::xid); The way to reproduce this problem is #1. set up and start the master and standby servers with track_commit_timestamp disabled #2. enable track_commit_timestamp in the master and restart the master #3. run some write transactions #4. enable track_commit_timestamp in the standby and restart the standby #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby BTW, at the step #4, I got the following log messages. This might be a hint for this problem. LOG: file pg_commit_ts/ doesn't exist, reading as zeroes CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09; inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache 45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16384 This problem still happens in the master. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Dec 26, 2014 at 12:31 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Dec 25, 2014 at 10:10 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier michael.paqu...@gmail.com wrote: Returning only a boolean is fine for me (that's what my first patch did), especially if we add at some point hooks for compression and decompression calls. Here is a patch rebased on current HEAD (60838df) for the core feature with the APIs of pglz using booleans as return values. After the revert of 1st patch moving pglz to src/common, I have reworked both patches, resulting in the attached. For pglz, the dependency to varlena has been removed to make the code able to run independently on both frontend and backend sides. In order to do that the APIs of pglz_compress and pglz_decompress have been changed a bit: - pglz_compress returns the number of bytes compressed. - pglz_decompress takes as additional argument the compressed length of the buffer, and returns the number of bytes decompressed instead of a simple boolean for consistency with the compression API. PGLZ_Header is not modified to keep the on-disk format intact. pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend which uses those functions needs to handle PGLZ_Header. But it basically should be handled via the varlena macros. That is, the frontend still seems to need to understand the varlena datatype. I think we should avoid that. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] The return value of allocate_recordbuf()
Hi, While reviewing FPW compression patch, I found that allocate_recordbuf() always returns TRUE though its source code comment says that FALSE is returned if out of memory. Its return value is checked in two places, but which is clearly useless. allocate_recordbuf() was introduced by 7fcbf6a, and then changed by 2c03216 so that palloc() is used instead of malloc and FALSE is never returned even if out of memory. So this seems an oversight of 2c03216. Maybe we should change it so that it checks whether we can enlarge the memory with the requested size before actually allocating the memory? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Dec 19, 2014 at 12:19 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Dec 18, 2014 at 5:27 PM, Fujii Masao masao.fu...@gmail.com wrote: Thanks! Thanks for your input. +else +memcpy(compression_scratch, page, page_len); I don't think the block image needs to be copied to scratch buffer here. We can try to compress the page directly. Check. +#include utils/pg_lzcompress.h #include utils/memutils.h pg_lzcompress.h should be after meutils.h. Oops. +/* Scratch buffer used to store block image to-be-compressed */ +static char compression_scratch[PGLZ_MAX_BLCKSZ]; Isn't it better to allocate the memory for compression_scratch in InitXLogInsert() like hdr_scratch? Because the OS would not touch it if wal_compression is never used, but now that you mention it, it may be better to get that in the context of xlog_insert.. +uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header)); Why don't we allocate the buffer for uncompressed page only once and keep reusing it like XLogReaderState-readBuf? The size of uncompressed page is at most BLCKSZ, so we can allocate the memory for it even before knowing the real size of each block image. OK, this would save some cycles. I was trying to make process allocate a minimum of memory only when necessary. -printf( (FPW); hole: offset: %u, length: %u\n, - record-blocks[block_id].hole_offset, - record-blocks[block_id].hole_length); +if (record-blocks[block_id].is_compressed) +printf( (FPW); hole offset: %u, compressed length %u\n, + record-blocks[block_id].hole_offset, + record-blocks[block_id].bkp_len); +else +printf( (FPW); hole offset: %u, length: %u\n, + record-blocks[block_id].hole_offset, + record-blocks[block_id].bkp_len); We need to consider what info about FPW we want pg_xlogdump to report. I'd like to calculate how much bytes FPW was compressed, from the report of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW and that of compressed one in the report. OK, so let's add a parameter in the decoder for the uncompressed length. Sounds fine? In pg_config.h, the comment of BLCKSZ needs to be updated? Because the maximum size of BLCKSZ can be affected by not only itemid but also XLogRecordBlockImageHeader. Check. boolhas_image; +boolis_compressed; Doesn't ResetDecoder need to reset is_compressed? Check. +#wal_compression = off# enable compression of full-page writes Currently wal_compression compresses only FPW, so isn't it better to place it after full_page_writes in postgresql.conf.sample? Check. +uint16extra_data;/* used to store offset of bytes in hole, with + * last free bit used to check if block is + * compressed */ At least to me, defining something like the following seems more easy to read. uint16hole_offset:15, is_compressed:1 Check++. Updated patches addressing all those things are attached. Thanks for updating the patch! Firstly I'm thinking to commit the 0001-Move-pg_lzcompress.c-to-src-common.patch. pg_lzcompress.h still exists in include/utils, but it should be moved to include/common? Do we really need PGLZ_Status? I'm not sure whether your categorization of the result status of compress/decompress functions is right or not. For example, pglz_decompress() can return PGLZ_INCOMPRESSIBLE status, but which seems invalid logically... Maybe this needs to be revisited when we introduce other compression algorithms and create the wrapper function for those compression and decompression functions. Anyway making pg_lzdecompress return the boolean value seems enough. I updated 0001-Move-pg_lzcompress.c-to-src-common.patch accordingly. Barring objections, I will push the attached patch firstly. Regards, -- Fujii Masao From 92a7c2ac8a6a8ec6ae84c6713f8ad30b7958de94 Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Tue, 25 Nov 2014 14:05:59 +0900 Subject: [PATCH] Move pg_lzcompress.c to src/common Exposing compression and decompression APIs of pglz makes possible its use by extensions and contrib modules. pglz_decompress contained a call to elog to emit an error message in case of corrupted data. This function is changed to return a status code to let its callers return an error instead. Compression function is changed similarly to make the whole set consistent. --- src/backend/access/heap/tuptoaster.c | 11 +- src/backend/utils/adt/Makefile|4 +- src/backend/utils/adt/pg_lzcompress.c | 779 src/common/Makefile |3 +- src/common
Re: [HACKERS] Moving RestoreBlockImage from xlogreader.c to xlogutils.c
On Wed, Dec 24, 2014 at 9:42 PM, Michael Paquier michael.paqu...@gmail.com wrote: Hi all, Commit 2c03216d has introduced RestoreBlockImage to restore a page from a given decoding state. ISTM that this is a backend-only operation but it has been added in xlogreader.c which could be used as well by frontend utilities like pg_xlogdump. Wouldn't it be better to declare it as a static routine in xlogutils.c? If we keep it in xlogreader.c, I think that we should at least wrap it with ifndef FRONTEND. Thoughts? If we do this, pg_lzcompress.c doesn't need to be moved to common for FPW compression patch which we're talking about in other thread. Right? DecodeXLogRecord() seems also a backend-only, so we should treat it in the same way as you proposed? Or pg_rewind uses that? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Moving RestoreBlockImage from xlogreader.c to xlogutils.c
On Wed, Dec 24, 2014 at 10:41 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Dec 24, 2014 at 10:16 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Dec 24, 2014 at 9:42 PM, Michael Paquier michael.paqu...@gmail.com wrote: Wouldn't it be better to declare it as a static routine in xlogutils.c? If we keep it in xlogreader.c, I think that we should at least wrap it with ifndef FRONTEND. If we do this, pg_lzcompress.c doesn't need to be moved to common for FPW compression patch which we're talking about in other thread. Right? Yes. This refactoring came to my mind while re-thinking about the WAL compression. This would also make more straight-forward the implementation of hooks for compression and decompression. Fair enough. Anyway I wait for applying the patch which moves pg_lzcompress.c until we will have reached any consensus about this. DecodeXLogRecord() seems also a backend-only, so we should treat it in the same way as you proposed? Or pg_rewind uses that? DecodeXLogRecord is used by XLogReadRecord, the latter being called by pg_xlogdump and also pg_rewind, so it is not backend-only. Yeah, you're right. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed rahilasye...@gmail.com wrote: I had a look at code. I have few minor points, Thanks! + bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE; + + if (is_compressed) { - rdt_datas_last-data = page; - rdt_datas_last-len = BLCKSZ; + /* compressed block information */ + bimg.length = compress_len; + bimg.extra_data = hole_offset; + bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK; For consistency with the existing code , how about renaming the macro XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of BKPBLOCK_HAS_IMAGE. OK, why not... + blk-hole_offset = extra_data ~XLR_BLCK_COMPRESSED_MASK; Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be more indicative of the fact that lower 15 bits of extra_data field comprises of hole_offset value. This suggestion is also just to achieve consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field. Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK though. And comment typo +* First try to compress block, filling in the page hole with zeros +* to improve the compression of the whole. If the block is considered +* as incompressible, complete the block header information as if +* nothing happened. As hole is no longer being compressed, this needs to be changed. Fixed. As well as an additional comment block down. A couple of things noticed on the fly: - Fixed pg_xlogdump being not completely correct to report the FPW information - A couple of typos and malformed sentences fixed - Added an assertion to check that the hole offset value does not the bit used for compression status - Reworked docs, mentioning as well that wal_compression is off by default. - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by Fujii-san) Thanks! +else +memcpy(compression_scratch, page, page_len); I don't think the block image needs to be copied to scratch buffer here. We can try to compress the page directly. +#include utils/pg_lzcompress.h #include utils/memutils.h pg_lzcompress.h should be after meutils.h. +/* Scratch buffer used to store block image to-be-compressed */ +static char compression_scratch[PGLZ_MAX_BLCKSZ]; Isn't it better to allocate the memory for compression_scratch in InitXLogInsert() like hdr_scratch? +uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header)); Why don't we allocate the buffer for uncompressed page only once and keep reusing it like XLogReaderState-readBuf? The size of uncompressed page is at most BLCKSZ, so we can allocate the memory for it even before knowing the real size of each block image. -printf( (FPW); hole: offset: %u, length: %u\n, - record-blocks[block_id].hole_offset, - record-blocks[block_id].hole_length); +if (record-blocks[block_id].is_compressed) +printf( (FPW); hole offset: %u, compressed length %u\n, + record-blocks[block_id].hole_offset, + record-blocks[block_id].bkp_len); +else +printf( (FPW); hole offset: %u, length: %u\n, + record-blocks[block_id].hole_offset, + record-blocks[block_id].bkp_len); We need to consider what info about FPW we want pg_xlogdump to report. I'd like to calculate how much bytes FPW was compressed, from the report of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW and that of compressed one in the report. In pg_config.h, the comment of BLCKSZ needs to be updated? Because the maximum size of BLCKSZ can be affected by not only itemid but also XLogRecordBlockImageHeader. boolhas_image; +boolis_compressed; Doesn't ResetDecoder need to reset is_compressed? +#wal_compression = off# enable compression of full-page writes Currently wal_compression compresses only FPW, so isn't it better to place it after full_page_writes in postgresql.conf.sample? +uint16extra_data;/* used to store offset of bytes in hole, with + * last free bit used to check if block is + * compressed */ At least to me, defining something like the following seems more easy to read. uint16hole_offset:15, is_compressed:1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minor improvement to explain.c
On Thu, Dec 18, 2014 at 12:52 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: Hi, The attached patch just removes one bad-looking blank line in the comments at the top of a function in explain.c. Applied. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com wrote: Currently, WALReceiver writes and fsyncs data it receives. Clearly, while we are waiting for an fsync we aren't doing any other useful work. Following patch starts WALWriter during recovery and makes it responsible for fsyncing data, allowing WALReceiver to progress other useful actions. +1 At present this is a WIP patch, for code comments only. Don't bother with anything other than code questions at this stage. Implementation questions are * How should we wake WALReceiver, since it waits on a poll(). Should we use SIGUSR1, which is already used for latch waits, or another signal? Probably we need to change libpqwalreceiver so that it uses the latch. This is useful even for the startup process to report the replay location to the walreceiver in real time. * Should we introduce some pacing delays if the WALreceiver gets too far ahead of apply? I don't think so for now. Instead, we can support synchronous_commit = replay, and the users can use that new mode if they are worried about the delay of WAL replay. * Other questions you may have? Who should wake the startup process so that it reads and replays the WAL data? Current walreceiver. But if walwriter is responsible for fsyncing WAL data, probably walwriter should do that. Because the startup process should not replay the WAL data which has not been fsync'd yet. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and WAL archive interactions
On Wed, Dec 17, 2014 at 4:11 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/16/2014 10:24 AM, Borodin Vladimir wrote: 12 дек. 2014 г., в 16:46, Heikki Linnakangas hlinnakan...@vmware.com написал(а): There have been a few threads on the behavior of WAL archiving, after a standby server is promoted [1] [2]. In short, it doesn't work as you might expect. The standby will start archiving after it's promoted, but it will not archive files that were replicated from the old master via streaming replication. If those files were not already archived in the master before the promotion, they are not archived at all. That's not good if you wanted to restore from a base backup + the WAL archive later. The basic setup is a master server, a standby, a WAL archive that's shared by both, and streaming replication between the master and standby. This should be a very common setup in the field, so how are people doing it in practice? Just live with the wisk that you might miss some files in the archive if you promote? Don't even realize there's a problem? Something else? Yes, I do live like that (with streaming replication and shared archive between master and replicas) and don’t even realize there’s a problem :( And I think I’m not the only one. Maybe at least a note should be added to the documentation? Let's try to figure out a way to fix this in master, but yeah, a note in the documentation is in order. +1 And how would we like it to work? Here's a plan: Have a mechanism in the standby, to track how far the master has archived its WAL, and don't throw away WAL in the standby that hasn't been archived in the master yet. This is similar to the physical replication slots, which prevent the master from recycling WAL that a standby hasn't received yet, but in reverse. I think we can use the .done and .ready files for this. Whenever a file is streamed (completely) from the master, create a .ready file for it. When we get an acknowledgement from the master that it has archived it, create a .done file for it. To get the information from the master, add the last archived WAL segment e.g. in the streaming replication keep-alive message, or invent a new message type for it. Sounds OK to me. How does this work in cascade replication case? The cascading walsender just relays the archive location to the downstream standby? What happens when WAL streaming is terminated and the startup process starts to read the WAL file from the archive? After reading the WAL file from the archive, probably we would need to change .ready files of every older WAL files to .done. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Dec 18, 2014 at 7:31 PM, Rahila Syed rahilasye...@gmail.com wrote: Isn't it better to allocate the memory for compression_scratch in InitXLogInsert() like hdr_scratch? I think making compression_scratch a statically allocated global variable is the result of following discussion earlier, http://www.postgresql.org/message-id/ca+tgmoaznbuwnls4bpwyqgqteeznoavy7kwdbm0a2-tbarn...@mail.gmail.com /* * Permanently allocate readBuf. We do it this way, rather than just * making a static array, for two reasons: (1) no need to waste the * storage in most instantiations of the backend; (2) a static char array * isn't guaranteed to have any particular alignment, whereas palloc() * will provide MAXALIGN'd storage. */ The above source code comment in XLogReaderAllocate() makes me think that it's better to avoid using a static array. The point (1) seems less important in this case because most processes need the buffer for WAL compression, though. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .gitignore config.cache and comment about regression.(out|diff)
On Tue, Dec 16, 2014 at 9:47 AM, Jim Nasby jim.na...@bluetreble.com wrote: config.cache is created when you pass -C to configure, which speeds it up considerably (3.5s vs 16.5 on my laptop). It would be nice to just ignore the cache file it generates. Originally this patch also ignored the regression output files, until I realized why that was a bad idea. Add a comment about that. Applied. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench -f and vacuum
On Sun, Dec 14, 2014 at 11:43 AM, Tatsuo Ishii is...@postgresql.org wrote: If we care enough about that case to attempt the vacuum anyway then we need to do something about the error message; either squelch it or check for the existence of the tables before attempting to vacuum. Since there's no way to squelch in the server logfile, I think checking for the table is the right answer. Fair enough. I will come up with checking for table before vacuum approach. +1 for this approach. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add generate_series(numeric, numeric)
On Mon, Dec 15, 2014 at 12:25 PM, Andrew Gierth and...@tao11.riddles.org.uk wrote: Fujii == Fujii Masao masao.fu...@gmail.com writes: Fujii Pushed. Bug found: regression=# select count(*) from generate_series(1::numeric,10) v, generate_series(1,v) w; count --- 0 (1 row) regression=# select count(*) from generate_series(1::numeric,10) v, generate_series(1,v+0) w; count --- 55 (1 row) The error is in the use of PG_GETARG_NUMERIC and init_var_from_num when setting up the multi-call state; init_var_from_num points at the original num's digits rather than copying them, but start_num and stop_num have just been (potentially) detoasted in the per-call context, in which case the storage will have been freed by the next call. Obviously this could also be fixed by not detoasting the input until after switching to the multi-call context, but it looks to me like that would be unnecessarily complex. Suggested patch attached. Pushed. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add generate_series(numeric, numeric)
On Mon, Dec 15, 2014 at 2:38 PM, Andrew Gierth and...@tao11.riddles.org.uk wrote: Ali == Ali Akbar the.ap...@gmail.com writes: Ali I think yes, it will be good. The alternative is restructuring Ali this paragraph in the SRF docs: The memory context that is current when the SRF is called is a transient context that will be cleared between calls. This means that you do not need to call pfree on everything you allocated using palloc; it will go away anyway. However, if you want to allocate any data structures to live across calls, you need to put them somewhere else. The memory context referenced by multi_call_memory_ctx is a suitable location for any data that needs to survive until the SRF is finished running. In most cases, this means that you should switch into multi_call_memory_ctx while doing the first-call setup. Ali The important part However, if you want to allocate any data Ali structures to live across calls, you need to put them somewhere Ali else. is buried in the docs. Ali But i think explicit warning is more noticeable for new Ali developer like me. I was thinking something like this, added just after that para: warning para While the actual arguments to the function remain unchanged between calls, if you detoast the argument values (which is normally done transparently by the functionPG_GETARG_replaceablexxx/replaceable/function macro) in the transient context then the detoasted copies will be freed on each cycle. Accordingly, if you keep references to such values in your structfielduser_fctx/, you must either copy them into the structfieldmulti_call_memory_ctx/ after detoasting, or ensure that you detoast the values only in that context. /para /warning I'm OK with this. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function to know last log write timestamp
On Fri, Nov 28, 2014 at 9:07 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Nov 26, 2014 at 4:05 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Aug 15, 2014 at 8:17 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-14 14:37:22 -0400, Robert Haas wrote: On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-14 14:19:13 -0400, Robert Haas wrote: That's about the idea. However, what you've got there is actually unsafe, because shmem-counter++ is not an atomic operation. It reads the counter (possibly even as two separate 4-byte loads if the counter is an 8-byte value), increments it inside the CPU, and then writes the resulting value back to memory. If two backends do this concurrently, one of the updates might be lost. All these are only written by one backend, so it should be safe. Note that that coding pattern, just without memory barriers, is all over pgstat.c Ah, OK. If there's a separate slot for each backend, I agree that it's safe. We should probably add barriers to pgstat.c, too. Yea, definitely. I think this is rather borked on weaker architectures. It's just that the consequences of an out of date/torn value are rather low, so it's unlikely to be noticed. Imo we should encapsulate the changecount modifications/checks somehow instead of repeating the barriers, Asserts, comments et al everywhere. So what about applying the attached patch first, which adds the macros to load and store the changecount with the memory barries, and changes pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4? After applying the patch, I will rebase the pg_last_xact_insert_timestamp patch and post it again. Hm, what's the status on this patch? The addition of those macros to control count increment with a memory barrier seems like a good thing at least. Thanks for reminding me of that! Barring any objection, I will commit it. Applied. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Wed, Dec 17, 2014 at 1:34 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier michael.paqu...@gmail.com wrote: Actually, the original length of the compressed block in saved in PGLZ_Header, so if we are fine to not check the size of the block decompressed when decoding WAL we can do without the hole filled with zeros, and use only 1 bit to see if the block is compressed or not. And.. After some more hacking, I have been able to come up with a patch that is able to compress blocks without the page hole, and that keeps the WAL record length the same as HEAD when compression switch is off. The numbers are pretty good, CPU is saved in the same proportions as previous patches when compression is enabled, and there is zero delta with HEAD when compression switch is off. Here are the actual numbers: test_name | pg_size_pretty | user_diff | system_diff ---++---+- FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 |0.807444 FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 |0.729626 FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 |0.570996 FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 |1.248503 FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 |0.755448 FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 |0.763761 FPW on + 0 bytes, ffactor 50 | 582 MB | 42.174297 |0.790596 FPW on + 0 bytes, ffactor 20 | 229 MB | 14.424233 |0.770459 FPW on + 0 bytes, ffactor 10 | 117 MB | 7.057195 |0.584806 FPW off + 0 bytes, ffactor 50 | 746 MB | 25.261998 |1.054516 FPW off + 0 bytes, ffactor 20 | 293 MB | 10.589888 |0.860207 FPW off + 0 bytes, ffactor 10 | 148 MB | 5.827191 |0.874285 HEAD, ffactor 50 | 746 MB | 25.181729 |1.133433 HEAD, ffactor 20 | 293 MB | 9.962242 |0.765970 HEAD, ffactor 10 | 148 MB | 5.693426 |0.775371 Record, ffactor 50| 582 MB | 54.904374 |0.678204 Record, ffactor 20| 229 MB | 19.798268 |0.807220 Record, ffactor 10| 116 MB | 9.401877 |0.668454 (18 rows) The new tests of this patch are FPW off + 0 bytes. Patches as well as results are attached. I think that neither pg_control nor xl_parameter_change need to have the info about WAL compression because each backup block has that entry. Will review the remaining part later. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments
On Fri, Dec 5, 2014 at 9:28 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, We've recently observed a case where, after a promotion, a postgres server suddenly started to archive a large amount of old WAL. After some digging the problem is this: pg_basebackup -X creates files in pg_xlog/ without creating the corresponding .done file. Note that walreceiver *does* create them. The standby in this case, just like the master, had a significant wal_keep_segments. RemoveOldXlogFiles() then, during recovery restart points, calls XLogArchiveCheckDone() which in turn does: /* Retry creation of the .ready file */ XLogArchiveNotify(xlog); return false; if there's neither a .done nor a .ready file present and archive_mode is enabled. These segments then aren't removed because there's a .ready present and they're never archived as long as the node is a standby because we don't do archiving on standbys. Once the node is promoted archiver will be started and suddenly archive all these files - which might be months old. And additional, at first strange, nice detail is that a lot of the .ready files had nearly the same timestamps. Turns out that's due to wal_keep_segments. Initially RemoveOldXlogFiles() doesn't process the files because they're newer than allowed due to wal_keep_segments. Then every checkpoint a couple segments would be old enough to reach XLogArchiveCheckDone() which then'd create the .ready marker... But not all at once :) So I think we just need to make pg_basebackup create to .ready files. s/.ready/.done? If yes, +1. Given that the walreceiver and restore_command already unconditionally do XLogArchiveForceDone() I think we'd follow the established precedent. Arguably it could make sense to archive files again on the standby after a promotion as they aren't guaranteed to have been on the then primary. But we don't have any infrastructure anyway for that and walsender doesn't do so, so it doesn't seem to make any sense to do that for pg_basebackup. Independent from this bug, there's also some debatable behaviour about what happens if a node with a high wal_keep_segments turns on archive_mode. Suddenly all those old files are archived... I think it might be a good idea to simply always create .done files when archive_mode is disabled while a wal segment is finished. +1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Dec 4, 2014 at 8:37 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Dec 4, 2014 at 7:36 PM, Rahila Syed rahilasyed...@gmail.com wrote: IIUC, forcibly written fpws are not exposed to user , so is it worthwhile to add a GUC similar to full_page_writes in order to control a feature which is unexposed to user in first place? If full page writes is set 'off' by user, user probably cannot afford the overhead involved in writing large pages to disk . So , if a full page write is forcibly written in such a situation it is better to compress it before writing to alleviate the drawbacks of writing full_page_writes in servers with heavy write load. The only scenario in which a user would not want to compress forcibly written pages is when CPU utilization is high. But according to measurements done earlier the CPU utilization of compress='on' and 'off' are not significantly different. Yes they are not visible to the user still they exist. I'd prefer that we have a safety net though to prevent any problems that may occur if compression algorithm has a bug as if we enforce compression for forcibly-written blocks all the backups of our users would be impacted. I pondered something that Andres mentioned upthread: we may not do the compression in WAL record only for blocks, but also at record level. Hence joining the two ideas together I think that we should definitely have a different GUC to control the feature, consistently for all the images. Let's call it wal_compression, with the following possible values: - on, meaning that a maximum of compression is done, for this feature basically full_page_writes = on. - full_page_writes, meaning that full page writes are compressed - off, default value, to disable completely the feature. This would let room for another mode: 'record', to completely compress a record. For now though, I think that a simple on/off switch would be fine for this patch. Let's keep things simple. +1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Pushed with some extra cosmetic tweaks. I got the following assertion failure when I executed pg_xact_commit_timestamp() in the standby server. =# select pg_xact_commit_timestamp('1000'::xid); TRAP: FailedAssertion(!(((oldestCommitTs) != ((TransactionId) 0)) == ((newestCommitTs) != ((TransactionId) 0))), File: commit_ts.c, Line: 315) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2014-12-04 12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by signal 6: Aborted 2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running: select pg_xact_commit_timestamp('1000'::xid); The way to reproduce this problem is #1. set up and start the master and standby servers with track_commit_timestamp disabled #2. enable track_commit_timestamp in the master and restart the master #3. run some write transactions #4. enable track_commit_timestamp in the standby and restart the standby #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby BTW, at the step #4, I got the following log messages. This might be a hint for this problem. LOG: file pg_commit_ts/ doesn't exist, reading as zeroes CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09; inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache 45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16384 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Thu, Dec 4, 2014 at 12:58 PM, Simon Riggs si...@2ndquadrant.com wrote: On 4 December 2014 at 03:08, Fujii Masao masao.fu...@gmail.com wrote: #1. set up and start the master and standby servers with track_commit_timestamp disabled #2. enable track_commit_timestamp in the master and restart the master #3. run some write transactions #4. enable track_commit_timestamp in the standby and restart the standby #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby I'm not sure what step4 is supposed to do? Surely if steps 1-3 generate any WAL then the standby should replay it, whether or not track_commit_timestamp is enabled. So what effect does setting that parameter on the standby? At least track_commit_timestamp seems to need to be enabled even in the standby when we want to call pg_xact_commit_timestamp() and pg_last_committed_xact() in the standby. I'm not sure if this is good design, though. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The problems of PQhost()
On Fri, Nov 28, 2014 at 3:43 AM, Noah Misch n...@leadboat.com wrote: On Fri, Nov 28, 2014 at 03:11:06AM +0900, Fujii Masao wrote: On Thu, Nov 27, 2014 at 12:38 PM, Noah Misch n...@leadboat.com wrote: Sure. I'll first issue git revert 9f80f48, then apply the attached patch. Since libpq ignores a hostaddr parameter equal to the empty string, this implementation does likewise. Apart from that, I anticipate behavior identical to today's code. +fprintf(stderr, _(out of memory\n)); psql_error() should be used instead of fprintf()? I copied what pg_malloc() would do. Either way seems reasonable. psql_error() seems better for the case where psql executes the specified input file. In this case, psql_error reports the message in the format like psql:filename:lineno: message. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function to know last log write timestamp
On Wed, Nov 26, 2014 at 4:05 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Aug 15, 2014 at 8:17 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-14 14:37:22 -0400, Robert Haas wrote: On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-14 14:19:13 -0400, Robert Haas wrote: That's about the idea. However, what you've got there is actually unsafe, because shmem-counter++ is not an atomic operation. It reads the counter (possibly even as two separate 4-byte loads if the counter is an 8-byte value), increments it inside the CPU, and then writes the resulting value back to memory. If two backends do this concurrently, one of the updates might be lost. All these are only written by one backend, so it should be safe. Note that that coding pattern, just without memory barriers, is all over pgstat.c Ah, OK. If there's a separate slot for each backend, I agree that it's safe. We should probably add barriers to pgstat.c, too. Yea, definitely. I think this is rather borked on weaker architectures. It's just that the consequences of an out of date/torn value are rather low, so it's unlikely to be noticed. Imo we should encapsulate the changecount modifications/checks somehow instead of repeating the barriers, Asserts, comments et al everywhere. So what about applying the attached patch first, which adds the macros to load and store the changecount with the memory barries, and changes pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4? After applying the patch, I will rebase the pg_last_xact_insert_timestamp patch and post it again. Hm, what's the status on this patch? The addition of those macros to control count increment with a memory barrier seems like a good thing at least. Thanks for reminding me of that! Barring any objection, I will commit it. The 2nd patch has not been rebased but still.. The feature that this 2nd patch implements is very similar to a part of what the committs patch does, i.e., tracking the timestamps of the committed transactions. If the committs patch will have been committed, basically I'd like to no longer work on the 2nd patch to avoid the duplicate work. OTOH, I'm concerned about the performance impact by the committs patch. So, for the simple use case like the check of replication lag, what the 2nd patch implements seems to be better, though... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_isready: Missing translation macros.
On Thu, Nov 27, 2014 at 9:32 PM, Mats Erik Andersson b...@gisladisker.se wrote: Hello there, the text response of pg_isready is hard coded in English. These short snippets really ought to be localized as well. Thanks for the patch! Committed. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql \watch always ignores \pset null
On Wed, Nov 19, 2014 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Will Leinweber w...@heroku.com writes: On Tue, Nov 18, 2014 at 9:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: Fujii Masao masao.fu...@gmail.com writes: Is there any reason why \watch must ignore \pset null setting? Hmmm ... the comment offers a reasonable argument for forcing pager = 0, but I agree the nullPrint change is not adequately explained. Will, do you remember why you did that? I tracked down the individual commit[1] from my history where I added that. What I added there is very similar to sections in src/bin/psql/describe.c. I can't remember specifically my reasoning then, but it's likely I copied the patterns there while getting things working. I do still think it's important to remove the pager, but the nullPrint is probably a mistake. I took a quick look and noted that the other places where nullPrint is summarily forced to null are for \d and similar queries. For those, the code can reasonably have an opinion about what the presentation should be like, since it knows what SQL query it's issuing. That argument surely doesn't apply to \watch, so I'm in agreement with Fujii that it'd be better to respect the user's \pset setting. Thanks! I've just fixed this problem. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why is PG_AUTOCONF_FILENAME is pg_config_manual.h?
On Thu, Nov 27, 2014 at 11:58 PM, Peter Eisentraut pete...@gmx.net wrote: Surely that's not a value that we expect users to be able to edit. Is pg_config_manual.h just abused as a place that's included everywhere? (I suggest utils/guc.h as a better place.) +1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The problems of PQhost()
On Thu, Nov 27, 2014 at 12:38 PM, Noah Misch n...@leadboat.com wrote: On Tue, Nov 25, 2014 at 09:53:10PM +0900, Fujii Masao wrote: On Tue, Nov 25, 2014 at 12:37 PM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 22, 2014 at 11:48:26PM +0900, Fujii Masao wrote: (3) PQhost() cannot return the hostaddr. We can fix the problem (3) by changing PQhost() so that it also returns the hostaddr. But this change might break the existing application using PQhost(). So, I added new libpq function PQhostaddr() which returns the hostaddr, and changed \conninfo so that it reports correct connection information by using both PQhost() and PQhostaddr(). + varlistentry id=libpq-pqhostaddr + term + functionPQhostaddr/function + indexterm +primaryPQhostaddr/primary + /indexterm + /term + + listitem + para +Returns the server numeric IP address or host name of the connection. + synopsis + char *PQhostaddr(const PGconn *conn); + /synopsis + /para + /listitem + /varlistentry From reading this documentation, I assumed this function would return a non-empty value for every TCP connection. After all, every TCP connection has a peer IP address. In fact, PQhostaddr() returns the raw value of the hostaddr connection parameter, whether from a libpq function argument or from the PGHOSTADDR environment variable. (If the parameter and environment variable are absent, it returns NULL. Adding hostaddr= to the connection string makes it return the empty string.) A caller wanting the specific raw value of a parameter could already use PQconninfo(). I suspect this new function will confuse more than help. What do you think of reverting it and having \conninfo use PQconninfo() to discover any hostaddr value? Sounds reasonable to me. Are you planning to implement and commit the patch? Sure. I'll first issue git revert 9f80f48, then apply the attached patch. Since libpq ignores a hostaddr parameter equal to the empty string, this implementation does likewise. Apart from that, I anticipate behavior identical to today's code. +fprintf(stderr, _(out of memory\n)); psql_error() should be used instead of fprintf()? +if (host == NULL)/* can't happen */ Is this comment true? ISTM that host can be NULL when the default socket directory is used, i.e., neither host nor hostaddr are supplied by the user. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf
On Wed, Nov 26, 2014 at 5:15 AM, Simon Riggs si...@2ndquadrant.com wrote: On 25 November 2014 at 14:06, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Nov 20, 2014 at 5:35 AM, Simon Riggs si...@2ndquadrant.com wrote: On 19 November 2014 16:41, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Nov 19, 2014 at 10:49 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Nov 19, 2014 at 8:13 AM, Simon Riggs si...@2ndquadrant.com wrote: If we ask for PAUSE and we're not in HotStandby it just ignores it, which means it changes into PROMOTE. My feeling is that we should change that into a SHUTDOWN, not a PROMOTE. To me, that seems like a definite improvement. But changing the default will force us to set action_at_recovery_target to 'promote' when we want to just recover the database to the specified point. If you explicitly request pause and then can't pause, ISTM the action we actually perform shouldn't be the exact opposite of what was requested. So if the user explicitly requests pause and we aren't in HS then they should get Shutdown, which is closest action. If the user doesn't request anything at all then we can default to Promote, just like we do now. Yes, this is what I was trying to suggest. +1 to do that. Implemented. Patch committed. Isn't it better to add the sample setting of this parameter into recovery.conf.sample? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The problems of PQhost()
On Tue, Nov 25, 2014 at 12:37 PM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 22, 2014 at 11:48:26PM +0900, Fujii Masao wrote: (3) PQhost() cannot return the hostaddr. We can fix the problem (3) by changing PQhost() so that it also returns the hostaddr. But this change might break the existing application using PQhost(). So, I added new libpq function PQhostaddr() which returns the hostaddr, and changed \conninfo so that it reports correct connection information by using both PQhost() and PQhostaddr(). + varlistentry id=libpq-pqhostaddr + term + functionPQhostaddr/function + indexterm +primaryPQhostaddr/primary + /indexterm + /term + + listitem + para +Returns the server numeric IP address or host name of the connection. + synopsis + char *PQhostaddr(const PGconn *conn); + /synopsis + /para + /listitem + /varlistentry From reading this documentation, I assumed this function would return a non-empty value for every TCP connection. After all, every TCP connection has a peer IP address. In fact, PQhostaddr() returns the raw value of the hostaddr connection parameter, whether from a libpq function argument or from the PGHOSTADDR environment variable. (If the parameter and environment variable are absent, it returns NULL. Adding hostaddr= to the connection string makes it return the empty string.) A caller wanting the specific raw value of a parameter could already use PQconninfo(). I suspect this new function will confuse more than help. What do you think of reverting it and having \conninfo use PQconninfo() to discover any hostaddr value? Sounds reasonable to me. Are you planning to implement and commit the patch? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: And here is v10 which fixes conflicts with Heikki's WAL API changes (no changes otherwise). After some slight additional changes, here's v11, which I intend to commit early tomorrow. The main change is moving the test module from contrib to src/test/modules. When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(), it always returns 2000-01-01 09:00:00+09. Is this intentional? Can I check my understanding? Probably we cannot use this feature to calculate the actual replication lag by, for example, comparing the result of pg_last_committed_xact() in the master and that of pg_last_xact_replay_timestamp() in the standby. Because pg_last_xact_replay_timestamp() can return even the timestamp of aborted transaction, but pg_last_committed_xact() cannot. Right? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers