from:"Fujii Masao"

Re: [HACKERS] Proposal : REINDEX xxx VERBOSE

2015-05-15 Thread Fujii Masao

/tab-complete.c
 @@ -3335,7 +3335,6 @@ psql_completion(const char *text, int start, int
 end)
  /* REINDEX */
 else if (pg_strcasecmp(prev_wd, REINDEX) == 0)
 {
 -   static const char *const list_REINDEX[] =
 {TABLE, INDEX, SYSTEM, SCHEMA, DATABASE, NULL};

 COMPLETE_WITH_LIST(list_REINDEX);


 The attached fix it and now seems good to me.
 Just one last note. IMHO we should add a regression to
 src/bin/scripts/090_reindexdb.pl.


 Thank you for your patch!
 (Sorry for attaching the patch still has compile error..)

 - 000_reindex_verbose_v13.patch
 Looks good to me.

 - 001_reindexdb_verbose_option_v1.patch
 I noticed a bug in reindexdb patch, so fixed version is attached.
 The regression test for reindexdb is added as well.

Pushed.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-04-30 Thread Fujii Masao

On Thu, Apr 30, 2015 at 12:57 PM, Sawada Masahiko sawada.m...@gmail.com wrote:
 On Wed, Apr 29, 2015 at 12:17 AM, David Steele da...@pgmasters.net wrote:
 On 4/28/15 2:14 AM, Sawada Masahiko wrote:
 On Fri, Apr 24, 2015 at 3:23 AM, David Steele da...@pgmasters.net wrote:
 I've also added some checking to make sure that if anything looks funny
 on the stack an error will be generated.

 Thanks for the feedback!


 Thank you for updating the patch!
 I ran the postgres regression test on database which is enabled
 pg_audit, it works fine.
 Looks good to me.

 If someone don't have review comment or bug report, I will mark this
 as Ready for Committer.

 Thank you!  I appreciate all your work reviewing this patch and of
 course everyone else who commented on, reviewed or tested the patch
 along the way.


 I have changed the status this to Ready for Committer.

The specification of session audit logging seems to be still unclear to me.
For example, why doesn't session audit logging log queries accessing to
a catalog like pg_class? Why doesn't it log any queries executed in aborted
transaction state? Since there is no such information in the document,
I'm afraid that users would easily get confused with it. Even if we document it,
I'm not sure if the current behavior is good for the audit purpose. For example,
some users may want to log even queries accessing to the catalogs.

I have no idea about when the current CommitFest will end. But probably
we don't have that much time left. So I'm thinking that maybe we should pick up
small, self-contained and useful part from the patch and focus on that.
If we try to commit every features that the patch provides, we might get
nothing at least in 9.5, I'm afraid.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FPW compression leaks information

2015-04-14 Thread Fujii Masao

On Wed, Apr 15, 2015 at 12:00 AM, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 04/14/2015 05:42 AM, Robert Haas wrote:

 On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

 As to RLS - yeah, that's where I think a lot of the possible covert

 channel attacks are.  But it doesn't have to be RLS per se.  It can
 be, for example, a table some of whose contents you expose via a
 security barrier view.  It can be a security-definer function that you
 call and it returns some data that you don't have rights to view
 directly.  Basically, it can be any table to which you have some
 access, but not complete access.  Then you can use timing attacks,
 scrutinize EXPLAIN plans for clues, look at CTIDs, and so on.

 Basically, I'm worried that it's going to be completely impractical to
 prevent a certain amount of information leakage when you have partial
 access to a table, and that in a largely-futile effort to try to
 prevent it, we'll end up making a whole bunch of things (like the WAL
 insert position) super-user only, and that this will in fact be a net
 reduction in security because it'll encourage people to use the
 superuser account more.

 Perhaps that concern is ill-founded, but that's what I'm worried about.


 I'm not a big fan of locking down WAL position information either. If we
 have to treat the current WAL position is security-sensitive information,
 we're doing something wrong.

 But I don't want to just give up either. One option is to make this
 controllable on a per-table basis, as Amit suggested. Then we could always
 disable it for pg_authid, add a suitable warning to the docs, and let the
 DBA enable/disable it for other tables. It's a bit of a cop-out, but would
 fix the immediate problem.


 I think it's a fairly narrow attack vector. As long as we document it
 clearly, and make it easy enough to turn it off, I think that's definitely
 enough. Most people will not care at all, I'm sure - but we need to cater to
 those that do.

 I think we could probably even get away with just documenting the risk and
 having people turn off the compression *completely* if they care about it,
 but if we can do it at a table level, that's obviously a lot better.

+1

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FPW compression leaks information

2015-04-14 Thread Fujii Masao

On Wed, Apr 15, 2015 at 11:55 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 11:10 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 12:00 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 04/14/2015 05:42 AM, Robert Haas wrote:

 On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

 As to RLS - yeah, that's where I think a lot of the possible covert

 channel attacks are.  But it doesn't have to be RLS per se.  It can
 be, for example, a table some of whose contents you expose via a
 security barrier view.  It can be a security-definer function that you
 call and it returns some data that you don't have rights to view
 directly.  Basically, it can be any table to which you have some
 access, but not complete access.  Then you can use timing attacks,
 scrutinize EXPLAIN plans for clues, look at CTIDs, and so on.

 Basically, I'm worried that it's going to be completely impractical to
 prevent a certain amount of information leakage when you have partial
 access to a table, and that in a largely-futile effort to try to
 prevent it, we'll end up making a whole bunch of things (like the WAL
 insert position) super-user only, and that this will in fact be a net
 reduction in security because it'll encourage people to use the
 superuser account more.

 Perhaps that concern is ill-founded, but that's what I'm worried about.


 I'm not a big fan of locking down WAL position information either. If we
 have to treat the current WAL position is security-sensitive information,
 we're doing something wrong.

 But I don't want to just give up either. One option is to make this
 controllable on a per-table basis, as Amit suggested. Then we could always
 disable it for pg_authid, add a suitable warning to the docs, and let the
 DBA enable/disable it for other tables. It's a bit of a cop-out, but would
 fix the immediate problem.


 I think it's a fairly narrow attack vector. As long as we document it
 clearly, and make it easy enough to turn it off, I think that's definitely
 enough. Most people will not care at all, I'm sure - but we need to cater to
 those that do.

 I think we could probably even get away with just documenting the risk and
 having people turn off the compression *completely* if they care about it,
 but if we can do it at a table level, that's obviously a lot better.

 +1

 OK. I am fine to implement anything required here if needed, meaning
 the following:
 1) Doc patch to mention that it is possible that compression can give
 hints to attackers when working on sensible fields that have a
 non-fixed size.

I think that this patch is enough as the first step.

 2) Switch at relation level to control wal_compression.

ALTER TABLE SET is not allowed on system catalog like pg_authid. So should we
change it so that a user can change the flag even on system catalog? I'm afraid
that the change might cause another problem, though. Probably we can disable
the compression on every system catalogs by default. But I can imagine that
someone wants to enable the compression even on system catalog. For example,
pg_largeobject may cause lots of FPW.

 This needs to
 change XLogRecordAssemble by adding some new logic to check if a
 relation is enforcing WAL compression or not. As a reloption, there
 are three possible values: true, false and fallback to system default.
 Also, I think that we should simply extend XLogRegisterBuffer() and
 pass to it the reloption flag that is then registered in
 registered_buffer, and XLogRecordAssemble() decides with this flag if
 block is compressed or not. Do we want to add this reloption switch to
 indexes as well?

Maybe.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : REINDEX xxx VERBOSE

2015-04-08 Thread Fujii Masao

On Wed, Apr 8, 2015 at 10:53 PM, Sawada Masahiko sawada.m...@gmail.com wrote:
 On Wed, Apr 8, 2015 at 1:09 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Apr 8, 2015 at 1:57 AM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 On Wed, Apr 8, 2015 at 1:11 AM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:

 On Tue, Apr 7, 2015 at 1:04 PM, Sawada Masahiko sawada.m...@gmail.com
 wrote:

 On Tue, Apr 7, 2015 at 10:12 PM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:
 
  On Tue, Apr 7, 2015 at 7:22 AM, Sawada Masahiko sawada.m...@gmail.com
  wrote:
 
  Thank you for your reviewing.
  I modified the patch and attached latest version patch(v7).
  Please have a look it.
 
 
  Looks good to me. Attached patch (v8) just fix a tab indentation in
  gram.y.
 

 I had forgotten fix a tab indentation, sorry.
 Thank you for reviewing!
 It looks good to me too.
 Can this patch be marked as Ready for Committer?


 +1


 Changed status to Ready for Committer.

 The patch adds new syntax like REINDEX ... WITH VERBOSE, i.e., () is not
 added after WITH clause. Did we reach the consensus about this syntax?
 The last email from Robert just makes me think that () should be added
 into the syntax.


 Thank you for taking time for this patch!

I removed the FORCE option from REINDEX, so you would need to update the patch.

 This was quite complicated issue since we already have a lot of style
 command currently.
 We can think many of style from various perspective: kind of DDL, new
 or old command, maintenance command. And each command has each its
 story.
 I believe we have reached the consensus with this style at least once
 (please see previous discussion), but we might needs to discuss
 more...

Okay, another question is that; WITH must be required whenever the options
are specified? Or should it be abbreviatable?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removal of FORCE option in REINDEX

2015-04-08 Thread Fujii Masao

On Thu, Apr 9, 2015 at 12:33 AM, Jim Nasby jim.na...@bluetreble.com wrote:
 On 4/8/15 7:04 AM, Fujii Masao wrote:

 I'm thinking to apply the attached patch.
 But does anyone want to keep supporting the option? Why?


 Nuke it.

There seem no objections, so just pushed.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind and log messages

2015-04-07 Thread Fujii Masao

On Tue, Apr 7, 2015 at 11:59 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Apr 6, 2015 at 9:10 PM, Fujii Masao masao.fu...@gmail.com wrote:
 I'm not familiar with native language support (sorry), but don't we need to
 add the shortcut of gettext into every calls of pg_log and pg_fatal, e.g.,
 change pg_fatal(xxx) to pg_fatal(_(xxx))? I know that fprintf() in
 pg_Log_v() has such shortcut, but I'm not sure if that's enough or not.

 I think I addressed those things...

  case PG_FATAL:
 -printf(\n%s, _(message));
 -printf(%s, _(Failure, exiting\n));
 +fprintf(stderr, _(\n%s: fatal: %s\n), progname, message);
 +fprintf(stderr, _(Failure, exiting\n));

 Why do we need to output the error level like fatal? This seems also
 inconsistent with the log message policy of other tools.

 initdb and psql do not for a couple of warning messages, but perhaps I
 am just taking consistency with the original code too seriously.

 Why do we need to output \n at the head of the message?
 The second message Failure, exiting is really required?

 I think that those things were here to highlight the fact that this is
 a fatal bailout, but the error code would help the same way I guess...

 I eliminated a bunch of
 newlines in the log messages that seemed really unnecessary to me,
 simplifying a bit the whole. While hacking this stuff, I noticed as
 well that pg_rewind could be called as root on non-Windows platform,
 that's dangerous from a security point of view as process manipulates
 files in PGDATA. Hence let's block that. On Windows, a restricted
 token should be used.

 Good catch! I think it's better to implement it as a separate patch
 because it's very different from log message problem.

 Attached are new patches:
 - 0001 fixes the restriction issues: restricted token on Windows, and
 forbid non-root user on non-Windows.

Thanks for the patch! I'd like to push this patch at first. But one comment is

+ You must run %s as the PostgreSQL superuser.\n,

Isn't the term PostgreSQL superuser confusing? I'm afraid that a user might
confuse PostgreSQL superuser with a database superuser. I see you just
borrowed that term from pg_resetxlog.c, though. BTW, initdb and pg_ctl also
have the same check and the error message is as follows. Isn't this better?
Thought?

(%s: cannot be run as root\n
   Please log in (using, e.g., \su\) as the 
   (unprivileged) user that will\n
   own the server process.\n

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind and log messages

2015-04-07 Thread Fujii Masao

On Mon, Apr 6, 2015 at 10:01 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Fujii Masao wrote:
 On Mon, Apr 6, 2015 at 5:33 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Mon, Apr 6, 2015 at 1:41 PM, Michael Paquier wrote:
  I guess that you are working on a patch? If not, you are looking for one?
 
  Code-speaking, this gives the patch attached.

 Thanks! Here are the review comments:

 I'm not familiar with native language support (sorry), but don't we need to
 add the shortcut of gettext into every calls of pg_log and pg_fatal, e.g.,
 change pg_fatal(xxx) to pg_fatal(_(xxx))? I know that fprintf() in
 pg_Log_v() has such shortcut, but I'm not sure if that's enough or not.

 It's not necessary for pg_fatal and the like, because those functions
 are marked to have their first argument automatically translated in
 nls.mk.  This means that the string literal is automatically extracted
 into pg_rewind.pot for translators.  Of course, the function itself must
 call _() (or some variant thereof) to actually fetch the translated
 string at run time.

Understood. Thanks!

BTW, as far as I read pg_rewind's nls.mk correctly, it also has two problems.

(1) file_ops.c should be added into GETTEXT_FILES.
(2) pg_log should be pg_log:2 in GETTEXT_TRIGGERS

 Another thing is compound messages like foo has happened\nSee --help
 for usage.\n and bar didn't happen\.See --help for usage.  In those
 cases, the see --help part would need to be translated over and over,
 so it's best to separate them in phrases to avoid repetitive work for
 translators.  Not sure how to do this -- maybe something like
 _(foo has happened\n) _(See --help)
 but I'm not sure how to appease the compiler.  Having them in two
 separate pg_log() calls (or whatever) was handy for this reason.

Yep.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind and log messages

2015-04-07 Thread Fujii Masao

On Wed, Apr 8, 2015 at 5:53 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Heikki Linnakangas wrote:
 On 04/07/2015 05:59 AM, Michael Paquier wrote:

 Fix inconsistent handling of logs in pg_rewind
 
 pg_rewind was handling a couple of things differently compared to the
 other src/bin utilities:
 - Logging output needs to be flushed on stderr, not stdout

 Agreed in general. But there's also precedent in printing some stuff to
 stdout: pg_ctl does that for the status message, like server starting. As
 does initdb.

 I'm pretty unclear on what the rule here is.

 One principle that sometimes helps is to consider what happens if you
 use the command as part of a larger pipeline; progress messages can be
 read by some other command further down (and perhaps report them in a
 dialog box, if you embed the program in a GUI, say), but error messages
 should probably be processed differently; normally the pipeline would be
 aborted as a whole.

Make sense.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : REINDEX xxx VERBOSE

2015-04-07 Thread Fujii Masao

On Wed, Apr 8, 2015 at 1:57 AM, Sawada Masahiko sawada.m...@gmail.com wrote:
 On Wed, Apr 8, 2015 at 1:11 AM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:

 On Tue, Apr 7, 2015 at 1:04 PM, Sawada Masahiko sawada.m...@gmail.com
 wrote:

 On Tue, Apr 7, 2015 at 10:12 PM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:
 
  On Tue, Apr 7, 2015 at 7:22 AM, Sawada Masahiko sawada.m...@gmail.com
  wrote:
 
  Thank you for your reviewing.
  I modified the patch and attached latest version patch(v7).
  Please have a look it.
 
 
  Looks good to me. Attached patch (v8) just fix a tab indentation in
  gram.y.
 

 I had forgotten fix a tab indentation, sorry.
 Thank you for reviewing!
 It looks good to me too.
 Can this patch be marked as Ready for Committer?


 +1


 Changed status to Ready for Committer.

The patch adds new syntax like REINDEX ... WITH VERBOSE, i.e., () is not
added after WITH clause. Did we reach the consensus about this syntax?
The last email from Robert just makes me think that () should be added
into the syntax.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Removal of FORCE option in REINDEX

2015-04-07 Thread Fujii Masao

Hi,

While reviewing the REINDEX VERBOSE patch, I felt inclined to remove FORCE
option support from REINDEX command. It has been marked obsolete since
very old version 7.4. I think that it's no longer worth keeping supporting it.
Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind and log messages

2015-04-06 Thread Fujii Masao

On Mon, Apr 6, 2015 at 5:33 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Apr 6, 2015 at 1:41 PM, Michael Paquier wrote:
 I guess that you are working on a patch? If not, you are looking for one?

 Code-speaking, this gives the patch attached.

Thanks! Here are the review comments:

I'm not familiar with native language support (sorry), but don't we need to
add the shortcut of gettext into every calls of pg_log and pg_fatal, e.g.,
change pg_fatal(xxx) to pg_fatal(_(xxx))? I know that fprintf() in
pg_Log_v() has such shortcut, but I'm not sure if that's enough or not.

 case PG_FATAL:
-printf(\n%s, _(message));
-printf(%s, _(Failure, exiting\n));
+fprintf(stderr, _(\n%s: fatal: %s\n), progname, message);
+fprintf(stderr, _(Failure, exiting\n));

Why do we need to output the error level like fatal? This seems also
inconsistent with the log message policy of other tools.

Why do we need to output \n at the head of the message?

The second message Failure, exiting is really required?

 I eliminated a bunch of
 newlines in the log messages that seemed really unnecessary to me,
 simplifying a bit the whole. While hacking this stuff, I noticed as
 well that pg_rewind could be called as root on non-Windows platform,
 that's dangerous from a security point of view as process manipulates
 files in PGDATA. Hence let's block that. On Windows, a restricted
 token should be used.

Good catch! I think it's better to implement it as a separate patch
because it's very different from log message problem.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Upper-case error in docs regarding PQmakeEmptyPGresult

2015-04-05 Thread Fujii Masao

On Mon, Apr 6, 2015 at 11:40 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Hi all,

 I just spotted that PQmakeEmptyPGresult is named PQmakeEmptyPGResult
 in one place (not Result, but result) in the docs.
 A patch is attached.

Pushed. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_rewind and log messages

2015-04-05 Thread Fujii Masao

Hi,

I found that pg_rewind has several problems about its log messages.

(1)
It outputs an error message to stdout not stderr.

(2)
The tool name should be added at the head of log message as follows,
but not in pg_rewind.

pg_basebackup: no target directory specified

(3)
if (datadir_source == NULL  connstr_source == NULL)
{
pg_fatal(no source specified (--source-pgdata or --source-server)\n);
pg_fatal(Try \%s --help\ for more information.\n, progname);
exit(1);

Since the first call of pg_fatal exits with 1, the subsequent pg_fatal and exit
will never be called.

(4)
ISTM that set_pglocale_pgservice() needs to be called, but not in pg_rewind.

(5)
printf() is used to output an error in some files, e.g., timeline.c and
parsexlog.c. These printf() should be replaced with pg_log or pg_fatal?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The return value of allocate_recordbuf()

2015-04-03 Thread Fujii Masao

On Fri, Apr 3, 2015 at 2:30 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Apr 3, 2015 at 12:56 PM, Fujii Masao wrote:
 The first patch looks good to me basically. But I have one comment:
 shouldn't we expose pg_malloc_extended as a global function like
 we did pg_malloc? Some frontends might need to use it in the future.

 Yes, it makes sense as the other functions do it too. So I refactored
 the patch and defined a new static inline routine,
 pg_malloc_internal(), that all the other functions call to reduce the
 temperature in this code path that I suppose can become quite hot even
 for frontends. In the second patch I added as well what was needed for
 pg_rewind.

Thanks for updating the patches!
I pushed the first and a part of the second patch.

Regarding the second patch, you added the checks of the return value of
XLogReaderAllocate(). But it seems half-baked. XLogReaderAllocate() still
uses palloc(), but don't we need to replace it with palloc_extended(), too?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The return value of allocate_recordbuf()

2015-04-03 Thread Fujii Masao

On Fri, Apr 3, 2015 at 8:37 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Apr 3, 2015 at 6:35 PM, Fujii Masao wrote:
 Regarding the second patch, you added the checks of the return value of
 XLogReaderAllocate(). But it seems half-baked. XLogReaderAllocate() still
 uses palloc(), but don't we need to replace it with palloc_extended(), too?

 Doh, you are right. I missed three places. Attached is a new patch
 completing the fix.

Thanks for the patch! I updated two source code comments and
changed the log message when XLogReaderAllocate returns NULL
within XLOG_DEBUG block. Just pushed.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The return value of allocate_recordbuf()

2015-04-02 Thread Fujii Masao

On Thu, Feb 12, 2015 at 4:02 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Feb 11, 2015 at 2:13 AM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Feb 9, 2015 at 7:02 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Mon, Feb 9, 2015 at 7:58 PM, Fujii Masao masao.fu...@gmail.com
  wrote:
  MemoryContextAllocExtended() was added, so isn't it time to replace
  palloc()
  with MemoryContextAllocExtended(CurrentMemoryContext,
  MCXT_ALLOC_NO_OOM)
  in allocate_recordbuf()?
 
  Yeah, let's move on here, but not with this patch I am afraid as this
  breaks any client utility using xlogreader.c in frontend, like
  pg_xlogdump or pg_rewind because MemoryContextAllocExtended is not
  available in frontend, only in backend. We are going to need something
  like palloc_noerror, defined on both backend (mcxt.c) and frontend
  (fe_memutils.c) side.

 Unfortunately, that gets us back into the exact terminological dispute
 that we were hoping to avoid.  Perhaps we could compromise on
 palloc_extended().


 Yes, why not using palloc_extended instead of palloc_noerror that has been
 clearly rejected in the other thread. Now, for palloc_extended we should
 copy the flags of MemoryContextAllocExtended to fe_memutils.h and have the
 same interface between frontend and backend (note that MCXT_ALLOC_HUGE has
 no real utility for frontends). Attached is a patch to achieve that,
 completed with a second patch for the problem related to this thread. Note
 that in the second patch I have added an ERROR in logical.c after calling
 XLogReaderAllocate, this was missing..

The first patch looks good to me basically. But I have one comment:
shouldn't we expose pg_malloc_extended as a global function like
we did pg_malloc? Some frontends might need to use it in the future.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-04-01 Thread Fujii Masao

On Tue, Mar 10, 2015 at 5:29 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hi, the attached is the v5 patch.

 - Do feGetCurrentTimestamp() only when necessary.
 - Rebased to current master


 At Mon, 2 Mar 2015 20:21:36 +0900, Fujii Masao masao.fu...@gmail.com wrote 
 in cahgqgwg1tjhpg03ozgwokxt5wyd5v4s3hutgsx7rotbhhnj...@mail.gmail.com
 On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI
 horiguchi.kyot...@lab.ntt.co.jp wrote:
  Hello, the attached is the v4 patch that checks feedback timing
  every WAL segments boundary.
 ..
  I said that checking whether to send feedback every boundary
  between WAL segments seemed too long but after some rethinking, I
  changed my mind.
 
   - The most large possible delay source in the busy-receive loop
 is fsyncing at closing WAL segment file just written, this can
 take several seconds.  Freezing longer than the timeout
 interval could not be saved and is not worth saving anyway.
 
   - 16M bytes-disk-writes intervals between gettimeofday() seems
 to be gentle enough even on platforms where gettimeofday() is
 rather heavy.

 Sounds reasonable to me.

 So we don't need to address the problem in walreceiver side so proactively
 because it tries to send the feedback every after flushing the WAL records.
 IOW, the problem you observed is less likely to happen. Right?

 +now = feGetCurrentTimestamp();
 +if (standby_message_timeout  0 

 Surely it would hardly happen, especially on ordinary
 configuration.

 However, the continuous receiving of the replication stream is a
 quite normal behavior even if hardly happens.  So the receiver
 should guarantee the feedbacks to be sent by logic as long as it
 is working normally, as long as the code for the special case
 won't be too large and won't take too long time:).

 The current walreceiver doesn't look trying to send feedbacks
 after flushing every wal records. HandleCopyStream will
 restlessly process the records in a gapless replication stream,
 sending feedback only when keepalive packet arrives. It is the
 fact that the response to the keepalive would help greatly but it
 is not what the keepalives are for. It is intended to be used to
 confirm if a silent receiver is still alive.

 Even with this fix, the case that one write or flush takes longer
 time than the feedback interval cannot be saved, but it would be
 ok since it should be regarded as disorder.


 Minor comment: should feGetCurrentTimestamp() be called after the check of
 standby_message_timeout  0, to avoid unnecessary calls of that?

 Ah, you're right. I'll fixed it.

Even if the timeout has not elapsed yet, if synchronous mode is true, we should
send feedback here?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2015-04-01 Thread Fujii Masao

On Wed, Apr 1, 2015 at 7:00 PM, Jehan-Guillaume de Rorthais
j...@dalibo.com wrote:
 Hi,

 As I'm writing a doc patch for 9.4 - 9.0, I'll discuss below on this formula
 as this is the last one accepted by most of you.

 On Mon, 3 Nov 2014 12:39:26 -0800
 Jeff Janes jeff.ja...@gmail.com wrote:

 It looked to me that the formula, when descending from a previously
 stressed state, would be:

 greatest(1 + checkpoint_completion_target) * checkpoint_segments,
 wal_keep_segments) + 1 +
 2 * checkpoint_segments + 1

 It lacks a closing parenthesis. I guess the formula is:

   greatest (
 (1 + checkpoint_completion_target) * checkpoint_segments,
  wal_keep_segments
   )
   + 1 + 2 * checkpoint_segments + 1

 This assumes logs are filled evenly over a checkpoint cycle, which is
 probably not true because there is a spike in full page writes right after
 a checkpoint starts.

 But I didn't have a great deal of confidence in my analysis.

 The only problem I have with this formula is that considering
 checkpoint_completion_target ~ 1 and wal_keep_segments = 0, it becomes:

   4 * checkpoint_segments + 2

 Which violate the well known, observed and default one:

   3 * checkpoint_segments + 1

 A value above this formula means the system can not cope with the number of
 file to flush. The doc is right about that:

If, due to a short-term peak of log output rate, there
are more than 3 * varnamecheckpoint_segments/varname + 1
segment files, the unneeded segment files will be deleted

 The formula is wrong in the doc when wal_keep_segments  0

 The first line reflects the number of WAL that will be retained as-is,

 I agree with this files MUST be retained: the set of checkpoint_segments WALs
 beeing flushed and the checkpoint_completion_target ones written in
 the meantime.

 the second is the number that will be recycled for future use before starting
 to delete them.

 disagree cause the WAL files beeing written are actually consuming recycled
 WALs in the meantime.

 Your formula expect written files are created and recycled ones never touched,
 leading to this checkpoint_segment + 1 difference between formulas.

 My reading of the code is that wal_keep_segments is computed from the
 current end of WAL (i.e the checkpoint record), not from the checkpoint
 redo point.  If I distribute the part outside the 'greatest' into both
 branches of the 'greatest', I don't get the same answer as you do for
 either branch.

 So The formula, using checkpoint_completion_target=1, should be:

   greatest (
  checkpoint_segments,
  wal_keep_segments
   )
   + 2 * checkpoint_segments + 1

No. Please imagine how many WAL files can exist at the end of checkpoint.
At the end of checkpoint, we have to leave all the WAL files which were
generated since the starting point of previous checkpoint for the future
crash recovery. The number of these WAL files is

(1 + checkpoint_completion_target) * checkpoint_segments

or

wal_keep_segments

In addition to these files, at the end of checkpoint, old WAL files which were
generated before the starting point of previous checkpoint are recycled.
The number of these WAL files is at most

2 * checkpoint_segments + 1

Note that *usually* there are not such many WAL files at the end of
checkpoint. But this can happen after the peak of WAL logging generates
too much WAL files.

So the sum of those is the right formula, i.e.,

(3 + checkpoint_completion_target) * checkpoint_segments + 1

   or

wal_keep_segments + 2 * checkpoint_segments + 1

If checkpoint_completion_target is 1 and wal_keep_segments is 0,
it can become 4 * checkpoint_segments + 1.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Table-level log_autovacuum_min_duration

2015-03-22 Thread Fujii Masao

On Thu, Mar 19, 2015 at 1:43 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Mar 19, 2015 at 12:40 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Thu, Mar 19, 2015 at 12:23 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Are you planning to update the patch so that it's based on the commit 
 0d83138?

 Yes... Very soon.

 And here is the rebased patch.

Thanks for rebasing the patch! Looks good to me.

One concern about this patch is; currently log_autovacuum_min_duration can be
changed even while autovacuum worker is running. So, for example, when
the admin notices that autovacuum is taking very long time, he or she can
enable logging of autovacuum activity on the fly. But this patch completely
prevents us from doing that, because, with the patch, autovacuum worker always
picks up the latest setting value at its starting time and then keeps using it
to the end. Probably I can live with this. But does anyone has other thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Table-level log_autovacuum_min_duration

2015-03-18 Thread Fujii Masao

On Fri, Mar 6, 2015 at 1:07 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
On Fri, Mar 6, 2015 at 12:44 PM, Fujii Masao masao.fu...@gmail.com wrote:
On Thu, Mar 5, 2015 at 9:49 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
On Thu, Mar 5, 2015 at 7:10 PM, Fujii Masao wrote:
With the patch, VACUUM ANALYZE VERBOSE doesn't emit any verbose message.
Why did you remove that functionality?

Oops. Sorry about that. In gram.y, the combination of VacuumStmt with
AnalyzeStmt overwrote the value of log_min_duration incorrectly. I
also found another bug related to logging of ANALYZE not working
correctly because of the use of IsAutoVacuumWorkerProcess() instead of
VACOPT_VERBOSE (this is reducing the diffs of the patch btw). All
those things are fixed in the attached.

Thanks for updating the patch!

Why does log_min_duration need to be set even when manual VACUUM command is
executed? Per the latest version of the patch, log_min_duration is checked
only
when the process is autovacuum worker. So ISTM that log_min_duration doesn't
need to be set in gram.y. It's even confusing to me. Or if you're going to
implement something like VACUUM VERBOSE DURATION n (i.e., verbose message
is output if n seconds have been elapsed), that might be necessary, though...

Thanks for reminding. The DURATION-like clause was exactly a point
mentioned by Anzai-san upthread, and it made sense to me to be in-line
with the other parameters controlling the freeze (discussion somewhat
related to that =
http://www.postgresql.org/message-id/CAB7nPqRZX7Pv2B-R7xHmAh52tfjAQGfy9btqwFstgQgXks=i...@mail.gmail.com)
but we can live without it for this patch as VACOPT_VERBOSE is used
only by manual VACUUM and not by autovacuum to choose the log elevel.

Are you planning to update the patch so that it's based on the commit 0d83138?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] xloginsert.c hole_length warning on gcc 4.8.3

2015-03-17 Thread Fujii Masao

On Sat, Mar 14, 2015 at 8:39 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 Hi there,

 with gcc 4.8.3, I'm getting this warning in xloginsert.c:

Thanks for the report! I fixed this problem at the commit cd6c45c.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-11 Thread Fujii Masao

On Wed, Mar 11, 2015 at 7:08 AM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello,

I have some minor comments

 The comments have been implemented in the attached patch.

Thanks for updating the patch! I just changed a bit and finally pushed it.
Thanks everyone involved in this patch!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-10 Thread Fujii Masao

On Mon, Mar 9, 2015 at 9:08 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila rahila.s...@nttdata.com 
 wrote:
 Please find attached  a patch. As discussed, flag to denote compression 
 and presence of hole in block image has been added in 
 XLogRecordImageHeader rather than block header.

 Thanks for updating the patch! Attached is the refactored version of the 
 patch.

 Cool. Thanks!

 I have some minor comments:

Thanks for the comments!

 +Turning this parameter on can reduce the WAL volume without
 Turning valueon/ this parameter

That tag is not used in other place in config.sgml, so I'm not sure if
that's really necessary.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-09 Thread Fujii Masao

On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila rahila.s...@nttdata.com wrote:
 Please find attached  a patch. As discussed, flag to denote compression and 
 presence of hole in block image has been added in XLogRecordImageHeader 
 rather than block header.

Thanks for updating the patch! Attached is the refactored version of the patch.

Regards,

-- 
Fujii Masao
*** a/contrib/pg_xlogdump/pg_xlogdump.c
--- b/contrib/pg_xlogdump/pg_xlogdump.c
***
*** 359,376  XLogDumpCountRecord(XLogDumpConfig *config, XLogDumpStats *stats,
  	rec_len = XLogRecGetDataLen(record) + SizeOfXLogRecord;
  
  	/*
! 	 * Calculate the amount of FPI data in the record. Each backup block
! 	 * takes up BLCKSZ bytes, minus the hole length.
  	 *
  	 * XXX: We peek into xlogreader's private decoded backup blocks for the
! 	 * hole_length. It doesn't seem worth it to add an accessor macro for
! 	 * this.
  	 */
  	fpi_len = 0;
  	for (block_id = 0; block_id = record-max_block_id; block_id++)
  	{
  		if (XLogRecHasBlockImage(record, block_id))
! 			fpi_len += BLCKSZ - record-blocks[block_id].hole_length;
  	}
  
  	/* Update per-rmgr statistics */
--- 359,375 
  	rec_len = XLogRecGetDataLen(record) + SizeOfXLogRecord;
  
  	/*
! 	 * Calculate the amount of FPI data in the record.
  	 *
  	 * XXX: We peek into xlogreader's private decoded backup blocks for the
! 	 * bimg_len indicating the length of FPI data. It doesn't seem worth it to
! 	 * add an accessor macro for this.
  	 */
  	fpi_len = 0;
  	for (block_id = 0; block_id = record-max_block_id; block_id++)
  	{
  		if (XLogRecHasBlockImage(record, block_id))
! 			fpi_len += record-blocks[block_id].bimg_len;
  	}
  
  	/* Update per-rmgr statistics */
***
*** 465,473  XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
     blk);
  			if (XLogRecHasBlockImage(record, block_id))
  			{
! printf( (FPW); hole: offset: %u, length: %u\n,
! 	   record-blocks[block_id].hole_offset,
! 	   record-blocks[block_id].hole_length);
  			}
  			putchar('\n');
  		}
--- 464,485 
     blk);
  			if (XLogRecHasBlockImage(record, block_id))
  			{
! if (record-blocks[block_id].bimg_info 
! 	BKPIMAGE_IS_COMPRESSED)
! {
! 	printf( (FPW); hole: offset: %u, length: %u, compression saved: %u\n,
! 		   record-blocks[block_id].hole_offset,
! 		   record-blocks[block_id].hole_length,
! 		   BLCKSZ -
! 		   record-blocks[block_id].hole_length -
! 		   record-blocks[block_id].bimg_len);
! }
! else
! {
! 	printf( (FPW); hole: offset: %u, length: %u\n,
! 		   record-blocks[block_id].hole_offset,
! 		   record-blocks[block_id].hole_length);
! }
  			}
  			putchar('\n');
  		}
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 2282,2287  include_dir 'conf.d'
--- 2282,2311 
/listitem
   /varlistentry
  
+  varlistentry id=guc-wal-compression xreflabel=wal_compression
+   termvarnamewal_compression/varname (typeboolean/type)
+   indexterm
+primaryvarnamewal_compression/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ When this parameter is literalon/, the productnamePostgreSQL/
+ server compresses a full page image written to WAL when
+ xref linkend=guc-full-page-writes is on or during a base backup.
+ A compressed page image will be decompressed during WAL replay.
+ The default value is literaloff/
+/para
+ 
+para
+ Turning this parameter on can reduce the WAL volume without
+ increasing the risk of unrecoverable data corruption,
+ but at the cost of some extra CPU time by the compression during
+ WAL logging and the decompression during WAL replay.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-wal-buffers xreflabel=wal_buffers
termvarnamewal_buffers/varname (typeinteger/type)
indexterm
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 89,94  char	   *XLogArchiveCommand = NULL;
--- 89,95 
  bool		EnableHotStandby = false;
  bool		fullPageWrites = true;
  bool		wal_log_hints = false;
+ bool		wal_compression = false;
  bool		log_checkpoints = false;
  int			sync_method = DEFAULT_SYNC_METHOD;
  int			wal_level = WAL_LEVEL_MINIMAL;
*** a/src/backend/access/transam/xloginsert.c
--- b/src/backend/access/transam/xloginsert.c
***
*** 24,35 
--- 24,39 
  #include access/xlog_internal.h
  #include access/xloginsert.h
  #include catalog/pg_control.h
+ #include common/pg_lzcompress.h
  #include miscadmin.h
  #include storage/bufmgr.h
  #include storage/proc.h
  #include utils/memutils.h
  #include pg_trace.h
  
+ /* Buffer size required to store a compressed version

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-08 Thread Fujii Masao

On Mon, Feb 16, 2015 at 9:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-02-16 20:55:20 +0900, Michael Paquier wrote:
 On Mon, Feb 16, 2015 at 8:30 PM, Syed, Rahila rahila.s...@nttdata.com
 wrote:

 
  Regarding the sanity checks that have been added recently. I think that
  they are useful but I am suspecting as well that only a check on the record
  CRC is done because that's reliable enough and not doing those checks
  accelerates a bit replay. So I am thinking that we should simply replace
  them by assertions.
 
  Removing the checks makes sense as CRC ensures correctness . Moreover ,as
  error message for invalid length of record is present in the code ,
  messages for invalid block length can be redundant.
 
  Checks have been replaced by assertions in the attached patch.
 

 After more thinking, we may as well simply remove them, an error with CRC
 having high chances to complain before reaching this point...

 Surely not. The existing code explicitly does it like
 if (blk-has_data  blk-data_len == 0)
 report_invalid_record(state,
   BKPBLOCK_HAS_DATA set, but no data included at 
 %X/%X,
   (uint32) (state-ReadRecPtr  32), (uint32) 
 state-ReadRecPtr);
 these cross checks are important. And I see no reason to deviate from
 that. The CRC sum isn't foolproof - we intentionally do checks at
 several layers. And, as you can see from some other locations, we
 actually try to *not* fatally error out when hitting them at times - so
 an Assert also is wrong.

 Heikki:
 /* cross-check that the HAS_DATA flag is set iff data_length  0 */
 if (blk-has_data  blk-data_len == 0)
 report_invalid_record(state,
   BKPBLOCK_HAS_DATA set, but no data included at 
 %X/%X,
   (uint32) 
 (state-ReadRecPtr  32), (uint32) state-ReadRecPtr);
 if (!blk-has_data  blk-data_len != 0)
 report_invalid_record(state,
  BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X,
   (unsigned int) 
 blk-data_len,
   
 (uint32) (state-ReadRecPtr  32), (uint32) state-ReadRecPtr);
 those look like they're missing a goto err; to me.

Yes. I pushed the fix. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-05 Thread Fujii Masao

On Thu, Mar 5, 2015 at 10:28 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-03-05 12:14:04 +, Syed, Rahila wrote:
 Please find attached  a patch. As discussed, flag to denote
 compression and presence of hole in block image has been added in
 XLogRecordImageHeader rather than block header.

 FWIW, I personally won't commit it with things done that way. I think
 it's going the wrong way, leading to a harder to interpret and less
 flexible format.  I'm not going to further protest if Fujii or Heikki
 commit it this way though.

I'm pretty sure that we can discuss the *better* WAL format even after
committing this patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WALWriter active during recovery

2015-03-05 Thread Fujii Masao

On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Currently, WALReceiver writes and fsyncs data it receives. Clearly,
 while we are waiting for an fsync we aren't doing any other useful
 work.

 Following patch starts WALWriter during recovery and makes it
 responsible for fsyncing data, allowing WALReceiver to progress other
 useful actions.

With the patch, replication didn't work fine in my machine. I started
the standby server after removing all the WAL files from the standby.
ISTM that the patch doesn't handle that case. That is, in that case,
the standby tries to start up walreceiver and replication to retrieve
the REDO-starting checkpoint record *before* starting up walwriter
(IOW, before reaching the consistent point). Then since walreceiver works
without walwriter, no received WAL data cannot be fsync'd in the standby.
So replication cannot advance furthermore. I think that walwriter needs
to start before walreceiver starts.

I just marked this patch as Waiting on Author.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Table-level log_autovacuum_min_duration

2015-03-05 Thread Fujii Masao

On Thu, Feb 19, 2015 at 3:32 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Feb 19, 2015 at 2:13 PM, Naoya Anzai
 anzai-na...@mxu.nes.nec.co.jp wrote:
 As a result, I think you should not delete VACOPT_VERBOSE.

 In v8 it is not deleted. It is still declared, and its use is isolated
 in gram.y, similarly to VACOPT_FREEZE.

 According to the last mail I have posted, the difference of
 manual-vacuum log and auto-vacuum log exists clearly.

 Did you test the latest patch v8? I have added checks in it to see if
 a process is an autovacuum worker to control elevel and the extra logs
 of v7 do not show up.

 So, at least you should not touch the mechanism of VACOPT_VERBOSE
 until both vacuum log formats are unified to a same format.

 If you mean that we should have the same kind of log outputs for
 autovacuum and manual vacuum, I think that this is not going to
 happen. Autovacuum entries are kept less verbose on purpose, contract
 that v7 clealy broke.

 If you agree my think, please undo your removing VACOPT_VERBOSE work.

 Well, I don't agree :) And I am guessing that you did not look at v8
 as well. Centralizing the control of logs using log_min_duration is
 more extensible than simply having VACOPT_VERBOSE.

With the patch, VACUUM ANALYZE VERBOSE doesn't emit any verbose message.
Why did you remove that functionality?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Table-level log_autovacuum_min_duration

2015-03-05 Thread Fujii Masao

On Thu, Mar 5, 2015 at 9:49 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Mar 5, 2015 at 7:10 PM, Fujii Masao wrote:
 With the patch, VACUUM ANALYZE VERBOSE doesn't emit any verbose message.
 Why did you remove that functionality?

 Oops. Sorry about that. In gram.y, the combination of VacuumStmt with
 AnalyzeStmt overwrote the value of log_min_duration incorrectly. I
 also found another bug related to logging of ANALYZE not working
 correctly because of the use of IsAutoVacuumWorkerProcess() instead of
 VACOPT_VERBOSE (this is reducing the diffs of the patch btw). All
 those things are fixed in the attached.

Thanks for updating the patch!

Why does log_min_duration need to be set even when manual VACUUM command is
executed? Per the latest version of the patch, log_min_duration is checked only
when the process is autovacuum worker. So ISTM that log_min_duration doesn't
need to be set in gram.y. It's even confusing to me. Or if you're going to
implement something like VACUUM VERBOSE DURATION n (i.e., verbose message
is output if n seconds have been elapsed), that might be necessary, though...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] File based Incremental backup v8

2015-03-04 Thread Fujii Masao

On Thu, Mar 5, 2015 at 1:59 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Hi Fujii,

 Il 03/03/15 11:48, Fujii Masao ha scritto:
 On Tue, Mar 3, 2015 at 12:36 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Il 02/03/15 14:21, Fujii Masao ha scritto:
 On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Hi,

 I've attached an updated version of the patch.

 basebackup.c:1565: warning: format '%lld' expects type 'long long
 int', but argument 8 has type '__off_t'
 basebackup.c:1565: warning: format '%lld' expects type 'long long
 int', but argument 8 has type '__off_t'
 pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code


 I'll add the an explicit cast at that two lines.

 When I applied three patches and compiled the code, I got the above 
 warnings.

 How can we get the full backup that we can use for the archive recovery, 
 from
 the first full backup and subsequent incremental backups? What commands 
 should
 we use for that, for example? It's better to document that.


 I've sent a python PoC that supports the plain format only (not the tar 
 one).
 I'm currently rewriting it in C (with also the tar support) and I'll send a 
 new patch containing it ASAP.

 Yeah, if special tool is required for that purpose, the patch should include 
 it.


 I'm working on it. The interface will be exactly the same of the PoC script 
 I've attached to

 54c7cdad.6060...@2ndquadrant.it

 What does 1 of the heading line in backup_profile mean?


 Nothing. It's a version number. If you think it's misleading I will remove 
 it.

 A version number of file format of backup profile? If it's required for
 the validation of backup profile file as a safe-guard, it should be included
 in the profile file. For example, it might be useful to check whether
 pg_basebackup executable is compatible with the source backup that
 you specify. But more info might be needed for such validation.


 The current implementation bail out with an error if the header line is 
 different from what it expect.
 It also reports and error if the 2nd line is not the start WAL location. 
 That's all that pg_basebackup needs to start a new incremental backup. All 
 the other information are useful to reconstruct a full backup in case of an 
 incremental backup, or maybe to check the completeness of an archived full 
 backup.
 Initially the profile was present only in incremental backups, but after some 
 discussion on list we agreed to always write it.

Don't we need more checks about the compatibility of the backup-target database
cluster and the source incremental backup? Without such more checks, I'm afraid
we can easily get a corrupted incremental backups. For example, pg_basebackup
should emit an error if the target and source have the different system IDs,
like walreceiver does? What happens if the timeline ID is different between the
source and target? What happens if the source was taken from the standby but
new incremental backup will be taken from the master? Do we need to check them?

 Sorry if this has been already discussed so far. Why is a backup profile 
 file
 necessary? Maybe it's necessary in the future, but currently seems not.

 It's necessary because it's the only way to detect deleted files.

 Maybe I'm missing something. Seems we can detect that even without a profile.
 For example, please imagine the case where the file has been deleted since
 the last full backup and then the incremental backup is taken. In this case,
 that deleted file exists only in the full backup. We can detect the deletion 
 of
 the file by checking both full and incremental backups.


 When you take an incremental backup, only changed files are sent. Without the 
 backup_profile in the incremental backup, you cannot detect a deleted file, 
 because it's indistinguishable from a file that is not changed.

Yeah, you're right!

 We've really gotten the consensus about the current design, especially that
 every files basically need to be read to check whether they have been 
 modified
 since last backup even when *no* modification happens since last backup?

 The real problem here is that there is currently no way to detect that a 
 file is not changed since the last backup. We agreed to not use file system 
 timestamps as they are not reliable for that purpose.

 TBH I prefer timestamp-based approach in the first version of incremental 
 backup
 even if's less reliable than LSN-based one. I think that some users who are
 using timestamp-based rsync (i.e., default mode) for the backup would be
 satisfied with timestamp-based one.

 The original design was to compare size+timestamp+checksums (only if 
 everything else matches and the file has been modified after the start of the 
 backup), but the feedback from the list was that we cannot trust the 
 filesystem mtime and we must use LSN instead.


 Using LSN have a significant advantage over using checksum, as we can start

Re: [HACKERS] Patch: raise default for max_wal_segments to 1GB

2015-03-03 Thread Fujii Masao

On Tue, Mar 3, 2015 at 4:25 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 03/03/2015 01:43 AM, Andres Freund wrote:

 On 2015-03-02 15:40:27 -0800, Josh Berkus wrote:

 ! #max_wal_size = 1GB   # in logfile segments


 Independent of the rest of the changes, the in logfile segments bit
 should probably be changed.


 The base unit is still logfile segments, so if you write just
 max_wal_size = 10 without a unit, that means 160MB. That's what the
 comment means.

 Is it needed? Many settings have a similar comment, but most don't. Most of
 the ones that do have a magic value 0, -1 as default, so that it's not
 obvious from the default what the unit is. For example:

 #tcp_keepalives_idle = 0# TCP_KEEPIDLE, in seconds;
 # 0 selects the system default
 #tcp_keepalives_interval = 0# TCP_KEEPINTVL, in seconds;
 # 0 selects the system default
 #temp_file_limit = -1   # limits per-session temp file space
 # in kB, or -1 for no limit
 #commit_delay = 0   # range 0-10, in microseconds
 #log_temp_files = -1# log temporary files equal or
 larger
 # than the specified size in
 kilobytes;
 # -1 disables, 0 logs all temp files
 #statement_timeout = 0  # in milliseconds, 0 is disabled
 #lock_timeout = 0   # in milliseconds, 0 is disabled
 #wal_keep_segments = 0  # in logfile segments, 16MB each; 0 disables

 But there are also:

 #wal_receiver_timeout = 60s # time that receiver waits for
 # communication from master
 # in milliseconds; 0 disables
 #autovacuum_vacuum_cost_delay = 20ms# default vacuum cost delay for
 # autovacuum, in milliseconds;
 # -1 means use vacuum_cost_delay


 I propose that we remove the comment from max_wal_size, and also remove the
 in milliseconds from wal_receiver_timeout and
 autovacuum_vacuum_cost_delay.

Seems OK to me. BTW, we can also remove in milliseconds from
wal_sender_timeout.

 (and we should also change wal_keep_segments to accept MB/GB, as discussed
 already)

+1

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-03 Thread Fujii Masao

On Tue, Mar 3, 2015 at 9:34 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Mar 3, 2015 at 9:24 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-03-03 08:59:30 +0900, Michael Paquier wrote:
 Already mentioned upthread, but I agree with Fujii-san here: adding
 information related to the state of a block image in
 XLogRecordBlockHeader makes little sense because we are not sure to
 have a block image, perhaps there is only data associated to it, and
 that we should control that exclusively in XLogRecordBlockImageHeader
 and let the block ID alone for now.

 This argument doesn't make much sense to me. The flag byte could very
 well indicate 'block reference without image following' vs 'block
 reference with data + hole following' vs 'block reference with
 compressed data following'.

 Information about the state of a block is decoupled with its
 existence, aka in the block header, we should control if:
 - record has data
 - record has a block
 And in the block image header, we control if the block is:
 - compressed or not
 - has a hole or not.

Are there any other flag bits that we should or are planning to add into
WAL header newly, except the above two? If yes and they are required by even
a block which doesn't have an image, I will change my mind and agree to
add something like chunk ID to a block header. But I guess the answer of the
question is No. Since the flag bits now we are thinking to add are required
only by a block having an image, adding them into a block header (instead of
block image header) seems a waste of bytes in WAL. So I concur with Michael.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Obsolete SnapshotNow reference within snapbuild.c

2015-03-03 Thread Fujii Masao

On Wed, Mar 4, 2015 at 10:31 AM, Peter Geoghegan p...@heroku.com wrote:
 SnapBuildCommitTxn() has what I gather is an obsolete reference to
 SnapshotNow(). Attached patch corrects this.

Pushed. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] File based Incremental backup v8

2015-03-03 Thread Fujii Masao

On Tue, Mar 3, 2015 at 12:36 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Il 02/03/15 14:21, Fujii Masao ha scritto:
 On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Hi,

 I've attached an updated version of the patch.

 basebackup.c:1565: warning: format '%lld' expects type 'long long
 int', but argument 8 has type '__off_t'
 basebackup.c:1565: warning: format '%lld' expects type 'long long
 int', but argument 8 has type '__off_t'
 pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code


 I'll add the an explicit cast at that two lines.

 When I applied three patches and compiled the code, I got the above warnings.

 How can we get the full backup that we can use for the archive recovery, from
 the first full backup and subsequent incremental backups? What commands 
 should
 we use for that, for example? It's better to document that.


 I've sent a python PoC that supports the plain format only (not the tar one).
 I'm currently rewriting it in C (with also the tar support) and I'll send a 
 new patch containing it ASAP.

Yeah, if special tool is required for that purpose, the patch should include it.

 What does 1 of the heading line in backup_profile mean?


 Nothing. It's a version number. If you think it's misleading I will remove it.

A version number of file format of backup profile? If it's required for
the validation of backup profile file as a safe-guard, it should be included
in the profile file. For example, it might be useful to check whether
pg_basebackup executable is compatible with the source backup that
you specify. But more info might be needed for such validation.

 Sorry if this has been already discussed so far. Why is a backup profile file
 necessary? Maybe it's necessary in the future, but currently seems not.

 It's necessary because it's the only way to detect deleted files.

Maybe I'm missing something. Seems we can detect that even without a profile.
For example, please imagine the case where the file has been deleted since
the last full backup and then the incremental backup is taken. In this case,
that deleted file exists only in the full backup. We can detect the deletion of
the file by checking both full and incremental backups.

 We've really gotten the consensus about the current design, especially that
 every files basically need to be read to check whether they have been 
 modified
 since last backup even when *no* modification happens since last backup?

 The real problem here is that there is currently no way to detect that a file 
 is not changed since the last backup. We agreed to not use file system 
 timestamps as they are not reliable for that purpose.

TBH I prefer timestamp-based approach in the first version of incremental backup
even if's less reliable than LSN-based one. I think that some users who are
using timestamp-based rsync (i.e., default mode) for the backup would be
satisfied with timestamp-based one.

 Using LSN have a significant advantage over using checksum, as we can start 
 the full copy as soon as we found a block whith a LSN greater than the 
 threshold.
 There are two cases: 1) the file is changed, so we can assume that we detect 
 it after reading 50% of the file, then we send it taking advantage of file 
 system cache; 2) the file is not changed, so we read it without sending 
 anything.
 It will end up producing an I/O comparable to a normal backup.

Yeah, it might make the situation better than today. But I'm afraid that
many users might get disappointed about that behavior of an incremental
backup after the release...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: raise default for max_wal_segments to 1GB

2015-03-03 Thread Fujii Masao

On Tue, Mar 3, 2015 at 8:51 AM, Josh Berkus j...@agliodbs.com wrote:
 On 03/02/2015 03:43 PM, Andres Freund wrote:
 Hi,

 On 2015-03-02 15:40:27 -0800, Josh Berkus wrote:
 ! #max_wal_size = 1GB# in logfile segments

 Independent of the rest of the changes, the in logfile segments bit
 should probably be changed.

 Point!  Although I think it's fair to point out that that wasn't my
 omission, but Heikki's.

 Version C, with that correction, attached.

Minor comments:

The default settings are 5 minutes and 128 MB, respectively. in wal.sgml
needs to be updated.


 intmax_wal_size = 8;/* 128 MB */

It's better to update the above code in xlog.c. That's not essential, though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] File based Incremental backup v8

2015-03-02 Thread Fujii Masao

On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Hi,

 I've attached an updated version of the patch.

basebackup.c:1565: warning: format '%lld' expects type 'long long
int', but argument 8 has type '__off_t'
basebackup.c:1565: warning: format '%lld' expects type 'long long
int', but argument 8 has type '__off_t'
pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code

When I applied three patches and compiled the code, I got the above warnings.

How can we get the full backup that we can use for the archive recovery, from
the first full backup and subsequent incremental backups? What commands should
we use for that, for example? It's better to document that.

What does 1 of the heading line in backup_profile mean?

Sorry if this has been already discussed so far. Why is a backup profile file
necessary? Maybe it's necessary in the future, but currently seems not.
Several infos like LSN, modification time, size, etc are tracked in a backup
profile file for every backup files, but they are not used for now. If it's now
not required, I'm inclined to remove it to simplify the code.

We've really gotten the consensus about the current design, especially that
every files basically need to be read to check whether they have been modified
since last backup even when *no* modification happens since last backup?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] remove pg_standby?

2015-03-02 Thread Fujii Masao

On Mon, Mar 2, 2015 at 8:37 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 03/02/2015 11:53 AM, Fujii Masao wrote:

 On Sat, Feb 28, 2015 at 5:00 AM, Josh Berkus j...@agliodbs.com wrote:

 On 11/10/2014 10:54 AM, Magnus Hagander wrote:

 On Mon, Nov 10, 2014 at 7:48 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 pg_standby is more configurable than the built-in standby_mode=on. You
 can
 set the sleep time, for example, while standby_mode=on uses a
 hard-coded
 delay of 5 s. And pg_standby has a configurable maximum wait time. And
 as
 Fujii pointed out, the built-in system will print an annoying message
 to the
 log every time it attempts to restore a file. Nevertheless, 99% of
 users
 would probably be happy with the built-in thing.


 As long as pg_standby has features that are actually useful and that
 are not in the built-in system, we shouldn't remove it. We should,
 however, try to fix those in the main system so we can get rid of it
 after that :)


 As of current 9.5, we have configurable retries and standby delay in
 mainstream.  Is there some reason we still need pg_standby?


 Yes, it's not easy to perform fast failover without pg_standby for now.


 What is fast failover?

Per document,

--
In fast failover, the server is brought up immediately. Any WAL files
in the archive that have not yet been applied will be ignored, and all
transactions in those files are lost. To trigger a fast failover,
create a trigger file and write the word fast into it. pg_standby can
also be configured to execute a fast failover automatically if no new
WAL file appears within a defined interval.
--

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] remove pg_standby?

2015-03-02 Thread Fujii Masao

On Sat, Feb 28, 2015 at 5:00 AM, Josh Berkus j...@agliodbs.com wrote:
 On 11/10/2014 10:54 AM, Magnus Hagander wrote:
 On Mon, Nov 10, 2014 at 7:48 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 pg_standby is more configurable than the built-in standby_mode=on. You can
 set the sleep time, for example, while standby_mode=on uses a hard-coded
 delay of 5 s. And pg_standby has a configurable maximum wait time. And as
 Fujii pointed out, the built-in system will print an annoying message to the
 log every time it attempts to restore a file. Nevertheless, 99% of users
 would probably be happy with the built-in thing.

 As long as pg_standby has features that are actually useful and that
 are not in the built-in system, we shouldn't remove it. We should,
 however, try to fix those in the main system so we can get rid of it
 after that :)

 As of current 9.5, we have configurable retries and standby delay in
 mainstream.  Is there some reason we still need pg_standby?

Yes, it's not easy to perform fast failover without pg_standby for now.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-03-02 Thread Fujii Masao

On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hello, the attached is the v4 patch that checks feedback timing
 every WAL segments boundary.

 At Fri, 20 Feb 2015 17:29:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 horiguchi.kyot...@lab.ntt.co.jp wrote in 
 20150220.172914.241732690.horiguchi.kyot...@lab.ntt.co.jp
  Some users may complain about the performance impact by such
  frequent calls and we may want to get rid of them from
  walreceiver loop in the future.  If we adopt your idea now,
  I'm afraid that it would tie our hands in that case.
 
  How much impact can such frequent calls of gettimeofday()
  have on replication performance? If it's not negligible,
  probably we should remove them at first and find out another
  idea to fix the problem you pointed. ISTM that it's not so
  difficult to remove them. Thought? Do you have any numbers
  which can prove that such frequent gettimeofday() has only
  ignorable impact on the performance?

 The attached patch is 'the more sober' version of SIGLARM patch.

 I said that checking whether to send feedback every boundary
 between WAL segments seemed too long but after some rethinking, I
 changed my mind.

  - The most large possible delay source in the busy-receive loop
is fsyncing at closing WAL segment file just written, this can
take several seconds.  Freezing longer than the timeout
interval could not be saved and is not worth saving anyway.

  - 16M bytes-disk-writes intervals between gettimeofday() seems
to be gentle enough even on platforms where gettimeofday() is
rather heavy.

Sounds reasonable to me.

So we don't need to address the problem in walreceiver side so proactively
because it tries to send the feedback every after flushing the WAL records.
IOW, the problem you observed is less likely to happen. Right?

+now = feGetCurrentTimestamp();
+if (standby_message_timeout  0 

Minor comment: should feGetCurrentTimestamp() be called after the check of
standby_message_timeout  0, to avoid unnecessary calls of that?

 ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
XLogRecPtr *blockpos, uint32 timeline,
char *basedir, stream_stop_callback stream_stop,
-   char *partial_suffix, bool mark_done)
+   char *partial_suffix, bool mark_done,
+   int standby_message_timeout, int64 *last_status)

Maybe it's time to refactor this ugly coding (i.e., currently many arguments
need to be given to each functions. Looks ugly)...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-02 Thread Fujii Masao

On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed rahilasye...@gmail.com wrote:
Even this patch doesn't work fine. The standby emit the following
error messages.

 Yes this bug remains unsolved. I am still working on resolving this.

 Following chunk IDs have been added in the attached patch as suggested
 upthread.
 +#define XLR_CHUNK_BLOCK_REFERENCE  0x10
 +#define XLR_CHUNK_BLOCK_HAS_IMAGE  0x04
 +#define XLR_CHUNK_BLOCK_HAS_DATA   0x08

 XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
 XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
 and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.

 Before sending a new version, be sure that this get fixed by for
 example building up a master with a standby replaying WAL, and running
 make installcheck-world or similar. If the standby does not complain
 at all, you have good chances to not have bugs. You could also build
 with WAL_DEBUG to check record consistency.

+1

When I test the WAL or replication related features, I usually run
make installcheck and pgbench against the master at the same time
after setting up the replication environment.

 typedef struct XLogRecordBlockHeader
 {
+uint8chunk_id;/* xlog fragment id */
 uint8id;/* block reference ID */

Seems this increases the header size of WAL record even if no backup block
image is included. Right? Isn't it better to add the flag info about backup
block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader?
Originally we borrowed one or two bits from its existing fields to minimize
the header size, but we can just add new flag field if we prefer
the extensibility and readability of the code.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-26 Thread Fujii Masao

On Tue, Feb 24, 2015 at 6:46 PM, Syed, Rahila rahila.s...@nttdata.com wrote:
 Hello ,

I've not read this logic yet, but ISTM there is a bug in that new WAL format 
because I got the following error and the startup process could not replay 
any WAL records when I set up replication and enabled wal_compression.

LOG:  record with invalid length at 0/3B0
LOG:  record with invalid length at 0/3000518
LOG:  Invalid block length in record 0/30005A0
LOG:  Invalid block length in record 0/3000D60 ...

 Please fine attached patch which replays WAL records.

Even this patch doesn't work fine. The standby emit the following
error messages.

LOG:  invalid block_id 255 at 0/3B0
LOG:  record with invalid length at 0/30017F0
LOG:  invalid block_id 255 at 0/3001878
LOG:  record with invalid length at 0/30027D0
LOG:  record with invalid length at 0/3002E58
...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring GUC unit conversions

2015-02-26 Thread Fujii Masao

On Fri, Feb 13, 2015 at 10:26 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 In the redesign checkpoint_segments patch, Robert suggested keeping the
 settings' base unit as number of segments, but allow conversions from MB,
 GB etc. I started looking into that and found that adding a new unit to
 guc.c is quite messy. The conversions are done with complicated
 if-switch-case constructs.

 Attached is a patch to refactor that, making the conversions table-driven.
 This doesn't change functionality, it just makes the code nicer.

Isn't it good idea to allow even wal_keep_segments to converse from MB, GB etc?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-02-25 Thread Fujii Masao

On Tue, Feb 24, 2015 at 1:29 AM, David Steele da...@pgmasters.net wrote:
 On 2/18/15 10:25 AM, David Steele wrote:
 On 2/18/15 6:11 AM, Fujii Masao wrote:
 The pg_audit doesn't log BIND parameter values when prepared statement is 
 used.
 Seems this is an oversight of the patch. Or is this intentional?

 It's actually intentional - following the model I talked about in my
 earlier emails, the idea is to log statements only.  This also follows
 on 2ndQuadrant's implementation.

 Unfortunately, I think it's beyond the scope of this module to log bind
 variables.

Maybe I can live with that at least in the first version.

 I'm following not only 2ndQuadrant's implementation, but
 Oracle's as well.

Oracle's audit_trail (e.g., = db, extended) can log even bind values.
Also log_statement=on in PostgreSQL also can log bind values.
Maybe we can reuse the same technique that log_statement uses.

 Logging values is interesting, but I'm sure the user would want to
 specify which columns to log, which I felt was beyond the scope of the
 patch.

 The pg_audit cannot log the statement like SELECT 1 which doesn't access 
 to
 the database object. Is this intentional? I think that there are many users 
 who
 want to audit even such statement.

 I think I see how to make this work.  I'll work on it for the next
 version of the patch.

 This has been fixed in the v2 patch.

Thanks!

 Imagine the case where you call the user-defined function which executes
 many nested statements. In this case, pg_audit logs only top-level statement
 (i.e., issued directly by client) every time nested statement is executed.
 In fact, one call of such UDF can cause lots of *same* log messages. I think
 this is problematic.

 I agree - not sure how to go about addressing it, though.  I've tried to
 cut down on the verbosity of the logging in general, but of course it
 can still be a problem.

 Using security definer and a different logging GUC for the defining role
 might work.  I'll add that to my unit tests and see what happens.

 That didn't work - but I didn't really expect it to.

 Here are two options I thought of:

 1) Follow Oracle's as session option and only log each statement type
 against an object the first time it happens in a session.  This would
 greatly reduce logging, but might be too little detail.  It would
 increase the memory footprint of the module to add the tracking.

 2) Only log once per call to the backend.  Essentially, we would only
 log the statement you see in pg_stat_activity.  This could be a good
 option because it logs what the user accesses directly, rather than
 functions, views, etc. which hopefully are already going through a
 review process and can be audited that way.

 Would either of those address your concerns?

Before discussing how to implement, probably we need to consider the
spec about this. For example, should we log even nested statements for
the audit purpose? If yes, how should we treat the case where
the user changes the setting so that only DDL is logged, and then
the user-defined function which internally executes DDL is called?
Since the top-level SQL (calling the function) is not the target of
audit, we should not log even the nested DDL?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-02-25 Thread Fujii Masao

On Thu, Feb 26, 2015 at 1:40 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Fujii Masao wrote:
 On Tue, Feb 24, 2015 at 1:29 AM, David Steele da...@pgmasters.net wrote:

  1) Follow Oracle's as session option and only log each statement type
  against an object the first time it happens in a session.  This would
  greatly reduce logging, but might be too little detail.  It would
  increase the memory footprint of the module to add the tracking.

 Doesn't this mean that a user could conceal malicious activity simply by
 doing a innocuous query that silences all further activity?

  2) Only log once per call to the backend.  Essentially, we would only
  log the statement you see in pg_stat_activity.  This could be a good
  option because it logs what the user accesses directly, rather than
  functions, views, etc. which hopefully are already going through a
  review process and can be audited that way.

 ... assuming the user does not create stuff on the fly behind your back.

  Would either of those address your concerns?

 Before discussing how to implement, probably we need to consider the
 spec about this. For example, should we log even nested statements for
 the audit purpose? If yes, how should we treat the case where
 the user changes the setting so that only DDL is logged, and then
 the user-defined function which internally executes DDL is called?
 Since the top-level SQL (calling the function) is not the target of
 audit, we should not log even the nested DDL?

 Clearly if you log only DROP TABLE, and then the malicious user hides
 one such call inside a function that executes the drop (which is called
 via a SELECT top-level SQL), you're not going to be happy.

Yep, so what SQL should be logged in this case? Only targeted nested DDL?
Both top and nested ones? Seems the later is better to me.

What about the case where the function A calls the function B executing DDL?
Every ancestor SQLs of the targeted DDL should be logged? Probably yes.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add generate_series(numeric, numeric)

2015-02-25 Thread Fujii Masao

On Wed, Jan 14, 2015 at 11:04 AM, Ali Akbar the.ap...@gmail.com wrote:

 2014-12-18 19:35 GMT+07:00 Fujii Masao masao.fu...@gmail.com:

 On Mon, Dec 15, 2014 at 2:38 PM, Andrew Gierth
 and...@tao11.riddles.org.uk wrote:
  I was thinking something like this, added just after that para:
 
  warning
   para
While the actual arguments to the function remain unchanged
  between
calls, if you detoast the argument values (which is normally done
transparently by the
functionPG_GETARG_replaceablexxx/replaceable/function
  macro)
in the transient context then the detoasted copies will be freed
  on
each cycle. Accordingly, if you keep references to such values in
your structfielduser_fctx/, you must either copy them into the
structfieldmulti_call_memory_ctx/ after detoasting, or ensure
that you detoast the values only in that context.
   /para
  /warning

 I'm OK with this.


 Wrapping the doc changes in a patch. Will add to next commitfest so it won't
 be lost.

Pushed. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-23 Thread Fujii Masao

On Fri, Feb 20, 2015 at 9:33 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a 
 trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, 
 like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing 
 something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not 
 related to
 the retry time to restore WAL from the archive.

 [Re-reading the code...]
 Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
 maximum of 5s then. I also noticed in previous patch that the wait was
 maximized to 5s. To begin with, a loop should have been used if it was
 a sleep, but as now patch uses a latch this limit does not make much
 sense... Patch updated is attached.

 On second thought, the interval to check the trigger file is very different
 from the wait time to retry to retrieve WAL. So it seems strange and even
 confusing to control them by one parameter. If we really want to make
 the interval for the trigger file configurable, we should invent new GUC for 
 it.
 But I don't think that it's worth doing that. If someone wants to react the
 trigger file more promptly for fast promotion, he or she basically can use
 pg_ctl promote command, instead. Thought?

 Hm, OK.

 Attached is the updated version of the patch. I changed the parameter so that
 it doesn't affect the interval of checking the trigger file.

 -static pg_time_t last_fail_time = 0;
 -pg_time_tnow;
 +TimestampTz now = GetCurrentTimestamp();
 +TimestampTzlast_fail_time = now;

 I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
 to be called for each WaitForWALToBecomeAvailable().

 +WaitLatch(XLogCtl-recoveryWakeupLatch,
 +  WL_LATCH_SET | WL_TIMEOUT,
 +  wait_time / 1000);

 Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.

 Yeah, I am wondering though why this has not been added after 89fd72cb though.

 +{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS,

 WAL_SETTINGS should be REPLICATION_STANDBY? Changed.

 Sure.

So I pushed the patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-23 Thread Fujii Masao

On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello,

 Attached is a patch which has following changes,

 As suggested above block ID in xlog structs has been replaced by chunk ID.
 Chunk ID is used to distinguish between different types of xlog record
 fragments.
 Like,
 XLR_CHUNK_ID_DATA_SHORT
 XLR_CHUNK_ID_DATA_LONG
 XLR_CHUNK_BKP_COMPRESSED
 XLR_CHUNK_BKP_WITH_HOLE

 In block references, block ID follows the chunk ID. Here block ID retains
 its functionality.
 This approach increases data by 1 byte for each block reference in an xlog
 record. This approach separates ID referring different fragments of xlog
 record from the actual block ID which is used to refer  block references in
 xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format
because I got the following error and the startup process could not replay
any WAL records when I set up replication and enabled wal_compression.

LOG:  record with invalid length at 0/3B0
LOG:  record with invalid length at 0/3000518
LOG:  Invalid block length in record 0/30005A0
LOG:  Invalid block length in record 0/3000D60
...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-20 Thread Fujii Masao

On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a 
 trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing 
 something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not related 
 to
 the retry time to restore WAL from the archive.

 [Re-reading the code...]
 Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
 maximum of 5s then. I also noticed in previous patch that the wait was
 maximized to 5s. To begin with, a loop should have been used if it was
 a sleep, but as now patch uses a latch this limit does not make much
 sense... Patch updated is attached.

On second thought, the interval to check the trigger file is very different
from the wait time to retry to retrieve WAL. So it seems strange and even
confusing to control them by one parameter. If we really want to make
the interval for the trigger file configurable, we should invent new GUC for it.
But I don't think that it's worth doing that. If someone wants to react the
trigger file more promptly for fast promotion, he or she basically can use
pg_ctl promote command, instead. Thought?

Attached is the updated version of the patch. I changed the parameter so that
it doesn't affect the interval of checking the trigger file.

-static pg_time_t last_fail_time = 0;
-pg_time_tnow;
+TimestampTz now = GetCurrentTimestamp();
+TimestampTzlast_fail_time = now;

I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
to be called for each WaitForWALToBecomeAvailable().

+WaitLatch(XLogCtl-recoveryWakeupLatch,
+  WL_LATCH_SET | WL_TIMEOUT,
+  wait_time / 1000);

Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.

+{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS,

WAL_SETTINGS should be REPLICATION_STANDBY? Changed.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 2985,2990  include_dir 'conf.d'
--- 2985,3008 
/listitem
   /varlistentry
  
+  varlistentry id=guc-wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval
+   termvarnamewal_retrieve_retry_interval/varname (typeinteger/type)
+   indexterm
+primaryvarnamewal_retrieve_retry_interval/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Specify the amount of time, in milliseconds, to wait when WAL is not
+ available from any sources (streaming replication,
+ local filenamepg_xlog/ or WAL archive) before retrying to
+ retrieve WAL.  This parameter can only be set in the
+ filenamepostgresql.conf/ file or on the server command line.
+ The default value is 5 seconds.
+/para
+   /listitem
+  /varlistentry
+ 
   /variablelist
  /sect2
 /sect1
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 93,98  int			sync_method = DEFAULT_SYNC_METHOD;
--- 93,99 
  int			wal_level

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-02-19 Thread Fujii Masao

On Wed, Feb 18, 2015 at 5:34 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hello, this is the last patch for pg_basebackup/pg_receivexlog on
 master (9.5). Preor versions don't have this issue.

 4. basebackup_reply_fix_mst_v2.patch
   receivelog.c patch applyable on master.

 This is based on the same design with
 walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday().

Thanks for updating the patches! But I'm still not sure if the idea depending
on the frequent calls of gettimeofday() for each WAL receive is good or not.
Some users may complain about the performance impact by such frequent calls
and we may want to get rid of them from walreceiver loop in the future.
If we adopt your idea now, I'm afraid that it would tie our hands in that case.

How much impact can such frequent calls of gettimeofday() have on replication
performance? If it's not negligible, probably we should remove them at first
and find out another idea to fix the problem you pointed. ISTM that it's not so
difficult to remove them. Thought? Do you have any numbers which can prove
that such frequent gettimeofday() has only ignorable impact on the performance?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-02-18 Thread Fujii Masao

On Wed, Feb 18, 2015 at 1:26 AM, David Steele da...@pgmasters.net wrote:
 On 2/17/15 10:23 AM, Simon Riggs wrote:
 I vote to include pgaudit in 9.5, albeit with any changes. In
 particular, David may have some changes to recommend, but I haven't
 seen a spec or a patch, just a new version of code (which isn't how we
 do things...).

 I submitted the new patch in my name under a separate thread Auditing
 extension for PostgreSQL (Take 2) (54e005cc.1060...@pgmasters.net)

I played the patch version of pg_audit a bit and have basic comments about
its spec.

The pg_audit doesn't log BIND parameter values when prepared statement is used.
Seems this is an oversight of the patch. Or is this intentional?

The pg_audit cannot log the statement like SELECT 1 which doesn't access to
the database object. Is this intentional? I think that there are many users who
want to audit even such statement.

Imagine the case where you call the user-defined function which executes
many nested statements. In this case, pg_audit logs only top-level statement
(i.e., issued directly by client) every time nested statement is executed.
In fact, one call of such UDF can cause lots of *same* log messages. I think
this is problematic.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments

2015-02-16 Thread Fujii Masao

On Fri, Feb 13, 2015 at 9:18 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-02-12 11:44:05 -0800, Sergey Konoplev wrote:
 On Thu, Feb 12, 2015 at 11:40 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  This obviously should not be the case. I'll have a look in a couple of 
  hours. Until then you can likely  just work around the problem by creating 
  the archive_status directory.

 Thank you. Just let me know if you need some extra info or debugging.

 No need for debugging. It's plain and simply a (cherry-pick) conflict I
 resolved wrongly during backpatching. 9.3, 9.4 and master do not have
 that problem. That whole fix was quite painful because every single
 release had significantly different code :(. pg_basebackup/ is pretty
 messy.
 I'm not sure why my testsuite didn't trigger that problem. Possibly
 because a retry makes things work :(

 Somewhat uckily it's 9.2 only (9.3, 9.4 and master look correct, earlier
 releases don't have pg_receivexlog)

Are you planning to back-patch the fix to 9.2?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-02-13 Thread Fujii Masao

On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hello,

 The attached patch is fix for walreciever not using gettimeofday,
 and fix for receivelog using it.

  XLogWalRcvProcessMsg doesn't send feedback when processing
  'w'=WAL record packet. So the same thing as pg_basebackup and
  pg_receivexlog will occur on walsender.
 
  Exiting the for(;;) loop by TimestampDifferenceExceeds just
  before line 439 is an equivalent measure to I poposed for
  receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
  is seemingly simpler (but I feel a bit uncomfortable for the
  latter)

 I'm concerned about the performance degradation by calling
 getimeofday() per a record.

  Or other measures?

 Introduce the maximum number of records that we can receive and
 process between feedbacks? For example, we can change
 XLogWalRcvSendHSFeedback() so that it's called per at least
 8 records. I'm not sure what number is good, though...

 At the beginning of the while(len  0){if (len  0){ block,
 last_recv_timestamp is always kept up to date (by using
 gettimeotda():). So we can use the variable instead of
 gettimeofday() in my previous proposal.

Yes, but something like last_recv_timestamp doesn't exist in
REL9_1_STABLE. So you don't think that your proposed fix
should be back-patched to all supported versions. Right?

 The start time of the timeout could be last_recv_timestamp at the
 beginning of the while loop, since it is a bit earlier than the
 actual time at the point.

Sounds strange to me. I think that last_recv_timestamp should be
compared with the time when the last feedback message was sent,
e.g., maybe you can expose sendTime in XLogWalRcvSendReplay()
as global variable, and use it to compare with last_recv_timestamp.

When we get out of the WAL receive loop due to the timeout check
which your patch added, XLogWalRcvFlush() is always executed.
I don't think that current WAL file needs to be flushed at that time.

 The another solution would be using
 RegisterTimeout/enable_timeout_after and it seemed to be work for
 me. It can throw out the time counting stuff in the loop we are
 talking about and that of XLogWalRcvSendHSFeedback and
 XLogWalRcvSendReply, but it might be a bit too large for the
 gain.

Yes, sounds overkill.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-09 Thread Fujii Masao

On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

You seem to have misunderstood the code in question. Or I'm missing something.
The timeout of the WaitLatch is just the interval to check for the trigger file
while waiting for more WAL to arrive from streaming replication. Not related to
the retry time to restore WAL from the archive.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The return value of allocate_recordbuf()

2015-02-09 Thread Fujii Masao

On Mon, Jan 5, 2015 at 1:25 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Jan 1, 2015 at 1:10 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Dec 29, 2014 at 6:14 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Hmm. There is no way to check beforehand if a palloc() will fail because of
 OOM. We could check for MaxAllocSize, though.

 I think we need a version of palloc that returns NULL instead of
 throwing an error.  The error-throwing behavior is for the best in
 almost every case, but I think the no-error version would find enough
 users to be worthwhile.
 Compression is one of those areas, be it compression of WAL or another
 type. The new API would allow to fallback to the non-compression code
 path if buffer allocation for compression cannot be done because of an
 OOM.

 FWIW, I actually looked at how to do that a couple of weeks back, and
 you just need a wrapper function, whose content is the existing
 AllocSetAlloc, taking an additional boolean flag to trigger an ERROR
 or leave with NULL if an OOM appears. On top of that we will need a
 new method in MemoryContextMethods, let's call it alloc_safe, for its
 equivalent, the new palloc_safe.

MemoryContextAllocExtended() was added, so isn't it time to replace palloc()
with MemoryContextAllocExtended(CurrentMemoryContext, MCXT_ALLOC_NO_OOM)
in allocate_recordbuf()?

Regards,

-- 
Fujii Masao
*** a/src/backend/access/transam/xlogreader.c
--- b/src/backend/access/transam/xlogreader.c
***
*** 149,155  allocate_recordbuf(XLogReaderState *state, uint32 reclength)
  
  	if (state-readRecordBuf)
  		pfree(state-readRecordBuf);
! 	state-readRecordBuf = (char *) palloc(newSize);
  	state-readRecordBufSize = newSize;
  	return true;
  }
--- 149,163 
  
  	if (state-readRecordBuf)
  		pfree(state-readRecordBuf);
! 	state-readRecordBuf =
! 		(char *) MemoryContextAllocExtended(CurrentMemoryContext,
! 			newSize,
! 			MCXT_ALLOC_NO_OOM);
! 	if (state-readRecordBuf == NULL)
! 	{
! 		state-readRecordBufSize = 0;
! 		return false;
! 	}
  	state-readRecordBufSize = newSize;
  	return true;
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-08 Thread Fujii Masao

On Fri, Feb 6, 2015 at 3:42 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Fujii Masao wrote:
 I wrote
 This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
 compression algorithm to return a size of 0 bytes as compressed or
 decompressed length btw? We could as well make it return a negative
 value when a failure occurs if you feel more comfortable with it.

 I feel that's better. Attached is the updated version of the patch.
 I changed the pg_lzcompress and pg_lzdecompress so that they return -1
 when failure happens. Also I applied some cosmetic changes to the patch
 (e.g., shorten the long name of the newly-added macros).
 Barring any objection, I will commit this.

 I just had a look at your updated version, ran some sanity tests, and
 things look good from me. The new names of the macros at the top of
 tuptoaster.c are clearer as well.

Thanks for the review! Pushed!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-06 Thread Fujii Masao

On Fri, Feb 6, 2015 at 4:15 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila rahila.s...@nttdata.com wrote:
/*
+* We recheck the actual size even if pglz_compress() report success,
+* because it might be satisfied with having saved as little as one byte
+* in the compressed data.
+*/
+   *len = (uint16) compressed_len;
+   if (*len = orig_len - 1)
+   return false;
+   return true;
+}

 As per latest code ,when compression is 'on' we introduce additional 2 bytes 
 in the header of each block image for storing raw_length of the compressed 
 block.
 In order to achieve compression while accounting for these two additional 
 bytes, we must ensure that compressed length is less than original length - 
 2.
 So , IIUC the above condition should rather be

 If (*len = orig_len -2 )
 return false;

2 should be replaced with the macro variable indicating the size of
extra header for compressed backup block.

Do we always need extra two bytes for compressed backup block?
ISTM that extra bytes are not necessary when the hole length is zero.
In this case the length of the original backup block (i.e., uncompressed)
must be BLCKSZ, so we don't need to save the original size in
the extra bytes.

Furthermore, when fpw compression is disabled and the hole length
is zero, we seem to be able to save one byte from the header of
backup block. Currently we use 4 bytes for the header, 2 bytes for
the length of backup block, 15 bits for the hole offset and 1 bit for
the flag indicating whether block is compressed or not. But in that case,
the length of backup block doesn't need to be stored because it must
be BLCKSZ. Shouldn't we optimize the header in this way? Thought?

+int page_len = BLCKSZ - hole_length;
+char *scratch_buf;
+if (hole_length != 0)
+{
+scratch_buf = compression_scratch;
+memcpy(scratch_buf, page, hole_offset);
+memcpy(scratch_buf + hole_offset,
+   page + (hole_offset + hole_length),
+   BLCKSZ - (hole_length + hole_offset));
+}
+else
+scratch_buf = page;
+
+/* Perform compression of block */
+if (XLogCompressBackupBlock(scratch_buf,
+page_len,
+regbuf-compressed_page,
+compress_len))
+{
+/* compression is done, add record */
+is_compressed = true;
+}

You can refactor XLogCompressBackupBlock() and move all the
above code to it for more simplicity.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-02-06 Thread Fujii Masao

On Fri, Feb 6, 2015 at 3:22 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Sorry, I misunderstood that.

  At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao masao.fu...@gmail.com 
  wrote in 
  cahgqgwgudgcmnhzinkd37i+jijdkruecrea1ncrs1mmte3r...@mail.gmail.com
  On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
  horiguchi.kyot...@lab.ntt.co.jp wrote:
   I'm very sorry for confused report. The problem found in 9.4.0
   and the diagnosis was mistakenly done on master.
  
   9.4.0 has no problem of feedback delay caused by slow xlog
   receiving on pg_basebackup mentioned in the previous mail. But
   the current master still has this problem.
 
  Seems walreceiver has the same problem. No?
 
  pg_receivexlog.c would have the same problem since it uses the
  same function with pg_basebackup.c.
 
  The correspondent of HandleCopyStream in wansender is
  WalReceiverMain, and it doesn't seem to have the same kind of
  loop shown below. It seems to surely send feedback per one
  record.
 
  |   r = stream_reader();
  |   while (r  0)
  |   {
  |  ... wal record processing stuff without sending feedback..
  |  r = stream_reader();
  |   }

 WalReceiverMain() has the similar code as follows.

 len = walrcv_receive(NAPTIME_PER_CYCLE, buf);
 if (len != 0)
 {
 for (;;)
 {
 if (len  0)
 {
 
 len = walrcv_receive(0, buf);
 }
 }

 The loop seems a bit different but eventually the same about this
 discussion.

 408 len = walrcv_receive(NAPTIME_PER_CYCLE, buf);
 409 if (len != 0)
 410 {
 415 for (;;)
 416 {
 417 if (len  0)
 418 {
 425XLogWalRcvProcessMsg(buf[0], buf[1], len - 1);
 426 }
 427-438 else {break;}
 439 len = walrcv_receive(0, buf);
 440 }
 441 }

 XLogWalRcvProcessMsg doesn't send feedback when processing
 'w'=WAL record packet. So the same thing as pg_basebackup and
 pg_receivexlog will occur on walsender.

 Exiting the for(;;) loop by TimestampDifferenceExceeds just
 before line 439 is an equivalent measure to I poposed for
 receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
 is seemingly simpler (but I feel a bit uncomfortable for the
 latter)

I'm concerned about the performance degradation by calling
getimeofday() per a record.

 Or other measures?

Introduce the maximum number of records that we can receive and
process between feedbacks? For example, we can change
XLogWalRcvSendHSFeedback() so that it's called per at least
8 records. I'm not sure what number is good, though...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-06 Thread Fujii Masao

On Fri, Feb 6, 2015 at 1:23 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Feb 5, 2015 at 11:58 PM, Michael Paquier wrote:
 An updated patch is attached.
 I just noticed that the patch I sent was incorrect:
 - Parameter name was still wal_availability_check_interval and not
 wal_retrieve_retry_interval
 - Documentation was incorrect.
 Please use the patch attached instead for further review.

Thanks for updating the patch!

timestamp.c:1708: warning: implicit declaration of function
'HandleStartupProcInterrupts'

I got the above warning at the compilation.

+pg_usleep(wait_time);
+HandleStartupProcInterrupts();
+total_time -= wait_time;

It seems strange to call the startup-process-specific function in
the generic function like TimestampSleepDifference().

 * 5. Sleep 5 seconds, and loop back to 1.

In WaitForWALToBecomeAvailable(), the above comment should
be updated.

- * Wait for more WAL to arrive. Time out after 5 seconds,
- * like when polling the archive, to react to a trigger
- * file promptly.
+ * Wait for more WAL to arrive. Time out after
the amount of
+ * time specified by wal_retrieve_retry_interval, like
+ * when polling the archive, to react to a
trigger file promptly.
  */
 WaitLatch(XLogCtl-recoveryWakeupLatch,
   WL_LATCH_SET | WL_TIMEOUT,
-  5000L);
+  wal_retrieve_retry_interval * 1000L);

This change can prevent the startup process from reacting to
a trigger file. Imagine the case where the large interval is set
and the user want to promote the standby by using the trigger file
instead of pg_ctl promote. I think that the sleep time should be 5s
if the interval is set to more than 5s. Thought?

+# specifies an optional internal to wait for WAL to become available when

s/internal/interval?

+This parameter specifies the amount of time to wait when a failure
+occurred when reading WAL from a source (be it via streaming
+replication, local filenamepg_xlog/ or WAL archive) for a node
+in standby mode, or when WAL is expected from a source.

At least to me, it seems not easy to understand what the parameter is
from this description. What about the following, for example?

This parameter specifies the amount of time to wait when WAL is not
available from any sources (streaming replication, local
filenamepg_xlog/ or WAL archive) before retrying to retrieve WAL

Isn't it better to place this parameter in postgresql.conf rather than
recovery.conf? I'd like to change the value of this parameter without
restarting the server.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-05 Thread Fujii Masao

On Tue, Jan 6, 2015 at 11:09 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Jan 5, 2015 at 10:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier wrote:
 The patch 1 cannot be applied to the master successfully because of
 recent change.
 Yes, that's caused by ccb161b. Attached are rebased versions.

 - The real stuff comes with patch 2, that implements the removal of
 PGLZ_Header, changing the APIs of compression and decompression to pglz to
 not have anymore toast metadata, this metadata being now localized in
 tuptoaster.c. Note that this patch protects the on-disk format (tested with
 pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
 compression and decompression look like with this patch, simply performing
 operations from a source to a destination:
 extern int32 pglz_compress(const char *source, int32 slen, char *dest,
   const PGLZ_Strategy *strategy);
 extern int32 pglz_decompress(const char *source, char *dest,
   int32 compressed_size, int32 raw_size);
 The return value of those functions is the number of bytes written in the
 destination buffer, and 0 if operation failed.

 So it's guaranteed that 0 is never returned in success case? I'm not sure
 if that case can really happen, though.
 This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
 compression algorithm to return a size of 0 bytes as compressed or
 decompressed length btw? We could as well make it return a negative
 value when a failure occurs if you feel more comfortable with it.

I feel that's better. Attached is the updated version of the patch.
I changed the pg_lzcompress and pg_lzdecompress so that they return -1
when failure happens. Also I applied some cosmetic changes to the patch
(e.g., shorten the long name of the newly-added macros).
Barring any objection, I will commit this.

Regards,

-- 
Fujii Masao
*** a/src/backend/access/heap/tuptoaster.c
--- b/src/backend/access/heap/tuptoaster.c
***
*** 35,43 
  #include access/tuptoaster.h
  #include access/xact.h
  #include catalog/catalog.h
  #include miscadmin.h
  #include utils/fmgroids.h
- #include utils/pg_lzcompress.h
  #include utils/rel.h
  #include utils/typcache.h
  #include utils/tqual.h
--- 35,43 
  #include access/tuptoaster.h
  #include access/xact.h
  #include catalog/catalog.h
+ #include common/pg_lzcompress.h
  #include miscadmin.h
  #include utils/fmgroids.h
  #include utils/rel.h
  #include utils/typcache.h
  #include utils/tqual.h
***
*** 45,50 
--- 45,70 
  
  #undef TOAST_DEBUG
  
+ /*
+  *	The information at the start of the compressed toast data.
+  */
+ typedef struct toast_compress_header
+ {
+ 	int32		vl_len_;	/* varlena header (do not touch directly!) */
+ 	int32		rawsize;
+ } toast_compress_header;
+ 
+ /*
+  * Utilities for manipulation of header information for compressed
+  * toast entries.
+  */
+ #define	TOAST_COMPRESS_HDRSZ		((int32) sizeof(toast_compress_header))
+ #define TOAST_COMPRESS_RAWSIZE(ptr)	(((toast_compress_header *) ptr)-rawsize)
+ #define TOAST_COMPRESS_RAWDATA(ptr) \
+ 	(((char *) ptr) + TOAST_COMPRESS_HDRSZ)
+ #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
+ 	(((toast_compress_header *) ptr)-rawsize = len)
+ 
  static void toast_delete_datum(Relation rel, Datum value);
  static Datum toast_save_datum(Relation rel, Datum value,
   struct varlena * oldexternal, int options);
***
*** 53,58  static bool toastid_valueid_exists(Oid toastrelid, Oid valueid);
--- 73,79 
  static struct varlena *toast_fetch_datum(struct varlena * attr);
  static struct varlena *toast_fetch_datum_slice(struct varlena * attr,
  		int32 sliceoffset, int32 length);
+ static struct varlena *toast_decompress_datum(struct varlena * attr);
  static int toast_open_indexes(Relation toastrel,
     LOCKMODE lock,
     Relation **toastidxs,
***
*** 138,148  heap_tuple_untoast_attr(struct varlena * attr)
  		/* If it's compressed, decompress it */
  		if (VARATT_IS_COMPRESSED(attr))
  		{
! 			PGLZ_Header *tmp = (PGLZ_Header *) attr;
! 
! 			attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ);
! 			SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ);
! 			pglz_decompress(tmp, VARDATA(attr));
  			pfree(tmp);
  		}
  	}
--- 159,166 
  		/* If it's compressed, decompress it */
  		if (VARATT_IS_COMPRESSED(attr))
  		{
! 			struct varlena *tmp = attr;
! 			attr = toast_decompress_datum(tmp);
  			pfree(tmp);
  		}
  	}
***
*** 163,173  heap_tuple_untoast_attr(struct varlena * attr)
  		/*
  		 * This is a compressed value inside of the main tuple
  		 */
! 		PGLZ_Header *tmp = (PGLZ_Header *) attr;
! 
! 		attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ);
! 		SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ);
! 		pglz_decompress(tmp, VARDATA(attr));
  	}
  	else if (VARATT_IS_SHORT(attr

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-02-04 Thread Fujii Masao

On Thu, Feb 5, 2015 at 10:20 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hello,

 At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao masao.fu...@gmail.com wrote 
 in cahgqgwgudgcmnhzinkd37i+jijdkruecrea1ncrs1mmte3r...@mail.gmail.com
 On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
 horiguchi.kyot...@lab.ntt.co.jp wrote:
  I'm very sorry for confused report. The problem found in 9.4.0
  and the diagnosis was mistakenly done on master.
 
  9.4.0 has no problem of feedback delay caused by slow xlog
  receiving on pg_basebackup mentioned in the previous mail. But
  the current master still has this problem.

 Seems walreceiver has the same problem. No?

 pg_receivexlog.c would have the same problem since it uses the
 same function with pg_basebackup.c.

 The correspondent of HandleCopyStream in wansender is
 WalReceiverMain, and it doesn't seem to have the same kind of
 loop shown below. It seems to surely send feedback per one
 record.

 |   r = stream_reader();
 |   while (r  0)
 |   {
 |  ... wal record processing stuff without sending feedback..
 |  r = stream_reader();
 |   }

WalReceiverMain() has the similar code as follows.

len = walrcv_receive(NAPTIME_PER_CYCLE, buf);
if (len != 0)
{
for (;;)
{
if (len  0)
{

len = walrcv_receive(0, buf);
}
}

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Docs: CREATE TABLESPACE minor markup fix

2015-02-04 Thread Fujii Masao

On Wed, Feb 4, 2015 at 5:27 PM, Ian Barwick i...@2ndquadrant.com wrote:
 Hi

 A superfluous '/' in an xref tag is producing an unintended ''
 in the Warning box on this page:

   http://www.postgresql.org/docs/current/interactive/sql-createtablespace.html

I found that logicaldecoding.sgml also has the same typo. Fixed both. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-02-04 Thread Fujii Masao

On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 I'm very sorry for confused report. The problem found in 9.4.0
 and the diagnosis was mistakenly done on master.

 9.4.0 has no problem of feedback delay caused by slow xlog
 receiving on pg_basebackup mentioned in the previous mail. But
 the current master still has this problem.

Seems walreceiver has the same problem. No?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-04 Thread Fujii Masao

On Tue, Jan 20, 2015 at 12:57 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-19 17:16:11 +0900, Michael Paquier wrote:
 On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  Not this patch's fault, but I'm getting a bit tired seeing the above open
  coded. How about adding a function that does the sleeping based on a
  timestamptz and a ms interval?
  You mean in plugins, right? I don't recall seeing similar patterns in
  other code paths of backend. But I think that we can use something
  like that in timestamp.c then because we need to leverage that between
  two timestamps, the last failure and now():
  TimestampSleepDifference(start_time, stop_time, internal_ms);
  Perhaps you have something else in mind?
 
  Attached is an updated patch.

 Actually I came with better than last patch by using a boolean flag as
 return value of TimestampSleepDifference and use
 TimestampDifferenceExceeds directly inside it.

 Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
  on failure

 I think that name isn't a very good. And its isn't very accurate
 either.
 How about wal_retrieve_retry_interval? Not very nice, but I think it's
 still better than the above.
 Fine for me.

 + varlistentry id=wal-availability-check-interval 
 xreflabel=wal_availability_check_interval
 +  termvarnamewal_availability_check_interval/varname 
 (typeinteger/type)
 +  indexterm
 +primaryvarnamewal_availability_check_interval/ recovery 
 parameter/primary
 +  /indexterm
 +  /term
 +  listitem
 +   para
 +This parameter specifies the amount of time to wait when
 +WAL is not available for a node in recovery. Default value is
 +literal5s/.
 +   /para
 +   para
 +A node in recovery will wait for this amount of time if
 +varnamerestore_command/ returns nonzero exit status code when
 +fetching new WAL segment files from archive or when a WAL receiver
 +is not able to fetch a WAL record when using streaming replication.
 +   /para
 +  /listitem
 + /varlistentry
 +
  /variablelist

 Walreceiver doesn't wait that amount, but rather how long the connection
 is intact. And restore_command may or may not retry.
 I changed this paragraph as follows:
 +This parameter specifies the amount of time to wait when a failure
 +occurred when reading WAL from a source (be it via streaming
 +replication, local filenamepg_xlog/ or WAL archive) for a node
 +in standby mode, or when WAL is expected from a source. Default
 +value is literal5s/.
 Pondering more about this patch, I am getting the feeling that it is
 not that much a win to explain in details in the doc each failure
 depending on the source from which WAL is taken. But it is essential
 to mention that the wait is done only for a node using standby mode.

 Btw, I noticed two extra things that I think should be done for consistency:
 - We should use as well wal_retrieve_retry_interval when waiting for
 WAL to arrive after checking for the failures:
 /*
 -* Wait for more WAL to
 arrive. Time out after 5 seconds,
 -* like when polling the
 archive, to react to a trigger
 -* file promptly.
 +* Wait for more WAL to
 arrive. Time out after
 +* wal_retrieve_retry_interval
 ms, like when polling the archive
 +* to react to a trigger file 
 promptly.
  */
 
 WaitLatch(XLogCtl-recoveryWakeupLatch,
   WL_LATCH_SET
 | WL_TIMEOUT,
 - 5000L);
 +
 wal_retrieve_retry_interval * 1L);
 -

 Updated patch attached.

+TimestampDifference(start_time, stop_time, secs, microsecs);
+pg_usleep(interval_msec * 1000L - (100L * secs + 1L * microsecs));

What if the large interval is set and a signal arrives during the sleep?
I'm afraid that a signal cannot terminate the sleep on some platforms.
This problem exists even now because pg_usleep is used, but the sleep
time is just 5 seconds, so it's not so bad. But the patch allows a user to
set large sleep time. Shouldn't we use WaitLatch or split the pg_usleep
like recoveryPausesHere() does?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments

2015-01-05 Thread Fujii Masao

On Mon, Jan 5, 2015 at 6:22 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-05 16:22:56 +0900, Fujii Masao wrote:
 On Sun, Jan 4, 2015 at 5:47 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2015-01-03 16:03:36 +0100, Andres Freund wrote:
  pg_basebackup really
  changed heavily since it's introduction. And desparately needs some
  restructuring.

 The patch seems to break pg_receivexlog. I got the following error message
 while running pg_receivexlog.

 pg_receivexlog: could not create archive status file
 mmm/archive_status/00010003.done: No such file or
 directory

 Dang. Stupid typo. And my tests didn't catch it, because I had
 archive_directory in the target directory :(

 At least it's only broken in master :/

 Thanks for the catch. Do you have some additional testsuite or did you
 catch it manually?

Manually... I just tested the tools and options which the patch may affect...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-05 Thread Fujii Masao

On Sat, Jan 3, 2015 at 1:52 AM, Bruce Momjian br...@momjian.us wrote:
On Fri, Jan 2, 2015 at 10:15:57AM -0600, k...@rice.edu wrote:
On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
I still don't understand the value of adding WAL compression, given the
high CPU usage and minimal performance improvement. The only big
advantage is WAL storage, but again, why not just compress the WAL file
when archiving.

before: pg_xlog is 800GB
after: pg_xlog is 600GB.

I'm damned sure that many people would be happy with that, even if the
*per backend* overhead is a bit higher. And no, compression of archives
when archiving helps *zap* with that (streaming, wal_keep_segments,
checkpoint_timeout). As discussed before.

Greetings,

Andres Freund

On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
above you can process 100GB of transaction data at the cost of a bit
more CPU.

OK, so given your stats, the feature give a 12.5% reduction in I/O. If
that is significant, shouldn't we see a performance improvement? If we
don't see a performance improvement, is I/O reduction worthwhile? Is it
valuable in that it gives non-database applications more I/O to use? Is
that all?

I suggest we at least document that this feature as mostly useful for
I/O reduction, and maybe say CPU usage and performance might be
negatively impacted.

OK, here is the email I remember from Fujii Masao this same thread that
showed a performance improvement for WAL compression:

http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hnr7ly516kft5ffqf4pp1q...@mail.gmail.com

Why are we not seeing the 33% compression and 15% performance
improvement he saw?

Because the benchmarks I and Michael used are very difffernet.
I just used pgbench, but he used his simple test SQLs (see
http://www.postgresql.org/message-id/cab7npqsc97o-ue5paxfmukwcxe_jioyxo1m4a0pmnmyqane...@mail.gmail.com).

Furthermore, the data type of pgbench_accounts.filler column is character(84)
and its content is empty, so pgbench_accounts is very compressible. This is
one of the reasons I could see good performance improvement and high compression
ratio.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-05 Thread Fujii Masao

On Sat, Jan 3, 2015 at 2:24 AM, Bruce Momjian br...@momjian.us wrote:
 On Fri, Jan  2, 2015 at 02:18:12PM -0300, Claudio Freire wrote:
 On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund and...@2ndquadrant.com wrote:
  , I now see the compression patch as something that has negatives, so
  has to be set by the user, and only wins in certain cases.  I am
  disappointed, and am trying to figure out how this became such a
  marginal win for 9.5.  :-(
 
  I find the notion that a multi digit space reduction is a marginal win
  pretty ridiculous and way too narrow focused. Our WAL volume is a
  *significant* problem in the field. And it mostly consists out of FPWs
  spacewise.

 One thing I'd like to point out, is that in cases where WAL I/O is an
 issue (ie: WAL archiving), usually people already compress the
 segments during archiving. I know I do, and I know it's recommended on
 the web, and by some consultants.

 So, I wouldn't want this FPW compression, which is desirable in
 replication scenarios if you can spare the CPU cycles (because of
 streaming), adversely affecting WAL compression during archiving.

 To be specific, desirable in streaming replication scenarios that don't
 use SSL compression.  (What percentage is that?)  It is something we
 should mention in the docs for this feature?

Even if SSL is used in replication, FPW compression is useful. It can reduce
the amount of I/O in the standby side. Sometimes I've seen walreceiver's I/O had
become a performance bottleneck especially in synchronous replication cases.
FPW compression can be useful for those cases, for example.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-01-05 Thread Fujii Masao

On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier
michael.paqu...@gmail.com wrote:


 On Fri, Dec 26, 2014 at 4:16 PM, Michael Paquier michael.paqu...@gmail.com
 wrote:
 On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao masao.fu...@gmail.com
 wrote:
 pglz_compress() and pglz_decompress() still use PGLZ_Header, so the
 frontend
 which uses those functions needs to handle PGLZ_Header. But it basically
 should
 be handled via the varlena macros. That is, the frontend still seems to
 need to
 understand the varlena datatype. I think we should avoid that. Thought?
 Hm, yes it may be wiser to remove it and make the data passed to pglz
 for varlena 8 bytes shorter..

 OK, here is the result of this work, made of 3 patches.

Thanks for updating the patches!

 The first two patches move pglz stuff to src/common and make it a frontend
 utility entirely independent on varlena and its related metadata.
 - Patch 1 is a simple move of pglz to src/common, with PGLZ_Header still
 present. There is nothing amazing here, and that's the broken version that
 has been reverted in 966115c.

The patch 1 cannot be applied to the master successfully because of
recent change.

 - The real stuff comes with patch 2, that implements the removal of
 PGLZ_Header, changing the APIs of compression and decompression to pglz to
 not have anymore toast metadata, this metadata being now localized in
 tuptoaster.c. Note that this patch protects the on-disk format (tested with
 pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
 compression and decompression look like with this patch, simply performing
 operations from a source to a destination:
 extern int32 pglz_compress(const char *source, int32 slen, char *dest,
   const PGLZ_Strategy *strategy);
 extern int32 pglz_decompress(const char *source, char *dest,
   int32 compressed_size, int32 raw_size);
 The return value of those functions is the number of bytes written in the
 destination buffer, and 0 if operation failed.

So it's guaranteed that 0 is never returned in success case? I'm not sure
if that case can really happen, though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typo in function header for recently added function errhidecontext

2015-01-04 Thread Fujii Masao

On Mon, Jan 5, 2015 at 3:19 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 /*
  * errhidestmt --- optionally suppress CONTEXT: field of log entry
  *
  * This should only be used for verbose debugging messages where the
 repeated
  * inclusion of CONTEXT: bloats the log volume too much.
  */
 int
 errhidecontext(bool hide_ctx)


 Here in function header, function name should be
 errhidecontext.

Fixed. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments

2015-01-04 Thread Fujii Masao

On Sun, Jan 4, 2015 at 5:47 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-03 16:03:36 +0100, Andres Freund wrote:
 On 2014-12-31 16:32:19 +0100, Andres Freund wrote:
  On 2014-12-05 16:18:02 +0900, Fujii Masao wrote:
   On Fri, Dec 5, 2014 at 9:28 AM, Andres Freund and...@2ndquadrant.com 
   wrote:
So I think we just need to make pg_basebackup create to .ready
files.
  
   s/.ready/.done? If yes, +1.
 
  That unfortunately requires changes to both backend and pg_basebackup to
  support fetch and stream modes respectively.
 
  I've attached a preliminary patch for this. I'd appreciate feedback. I
  plan to commit it in a couple of days, after some more
  testing/rereading.

 Attached are two updated patches that I am starting to backport
 now. I've fixed a couple minor oversights. And tested the patches.

 Pushed this after some major pain with backporting.

Thanks!

 pg_basebackup really
 changed heavily since it's introduction. And desparately needs some
 restructuring.

The patch seems to break pg_receivexlog. I got the following error message
while running pg_receivexlog.

pg_receivexlog: could not create archive status file
mmm/archive_status/00010003.done: No such file or
directory

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2015-01-04 Thread Fujii Masao

On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 Pushed with some extra cosmetic tweaks.

 I got the following assertion failure when I executed 
 pg_xact_commit_timestamp()
 in the standby server.

 =# select pg_xact_commit_timestamp('1000'::xid);
 TRAP: FailedAssertion(!(((oldestCommitTs) != ((TransactionId) 0)) ==
 ((newestCommitTs) != ((TransactionId) 0))), File: commit_ts.c,
 Line: 315)
 server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
 The connection to the server was lost. Attempting reset: 2014-12-04
 12:01:08 JST sby1 LOG:  server process (PID 15545) was terminated by
 signal 6: Aborted
 2014-12-04 12:01:08 JST sby1 DETAIL:  Failed process was running:
 select pg_xact_commit_timestamp('1000'::xid);

 The way to reproduce this problem is

 #1. set up and start the master and standby servers with
 track_commit_timestamp disabled
 #2. enable track_commit_timestamp in the master and restart the master
 #3. run some write transactions
 #4. enable track_commit_timestamp in the standby and restart the standby
 #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby

 BTW, at the step #4, I got the following log messages. This might be a hint 
 for
 this problem.

 LOG:  file pg_commit_ts/ doesn't exist, reading as zeroes
 CONTEXT:  xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09;
 inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache
 45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608
 relcache 16384

This problem still happens in the master.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-25 Thread Fujii Masao

On Fri, Dec 26, 2014 at 12:31 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Dec 25, 2014 at 10:10 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Returning only a boolean is fine for me (that's what my first patch
 did), especially if we add at some point hooks for compression and
 decompression calls.
 Here is a patch rebased on current HEAD (60838df) for the core feature
 with the APIs of pglz using booleans as return values.
 After the revert of 1st patch moving pglz to src/common, I have
 reworked both patches, resulting in the attached.

 For pglz, the dependency to varlena has been removed to make the code
 able to run independently on both frontend and backend sides. In order
 to do that the APIs of pglz_compress and pglz_decompress have been
 changed a bit:
 - pglz_compress returns the number of bytes compressed.
 - pglz_decompress takes as additional argument the compressed length
 of the buffer, and returns the number of bytes decompressed instead of
 a simple boolean for consistency with the compression API.
 PGLZ_Header is not modified to keep the on-disk format intact.

pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend
which uses those functions needs to handle PGLZ_Header. But it basically should
be handled via the varlena macros. That is, the frontend still seems to need to
understand the varlena datatype. I think we should avoid that. Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] The return value of allocate_recordbuf()

2014-12-25 Thread Fujii Masao

Hi,

While reviewing FPW compression patch, I found that allocate_recordbuf()
always returns TRUE though its source code comment says that FALSE is
returned if out of memory. Its return value is checked in two places, but
which is clearly useless.

allocate_recordbuf() was introduced by 7fcbf6a, and then changed by
2c03216 so that palloc() is used instead of malloc and FALSE is never returned
even if out of memory. So this seems an oversight of 2c03216. Maybe
we should change it so that it checks whether we can enlarge the memory
with the requested size before actually allocating the memory?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-24 Thread Fujii Masao

On Fri, Dec 19, 2014 at 12:19 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Dec 18, 2014 at 5:27 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Thanks!
 Thanks for your input.

 +else
 +memcpy(compression_scratch, page, page_len);

 I don't think the block image needs to be copied to scratch buffer here.
 We can try to compress the page directly.
 Check.

 +#include utils/pg_lzcompress.h
  #include utils/memutils.h

 pg_lzcompress.h should be after meutils.h.
 Oops.

 +/* Scratch buffer used to store block image to-be-compressed */
 +static char compression_scratch[PGLZ_MAX_BLCKSZ];

 Isn't it better to allocate the memory for compression_scratch in
 InitXLogInsert()
 like hdr_scratch?
 Because the OS would not touch it if wal_compression is never used,
 but now that you mention it, it may be better to get that in the
 context of xlog_insert..

 +uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));

 Why don't we allocate the buffer for uncompressed page only once and
 keep reusing it like XLogReaderState-readBuf? The size of uncompressed
 page is at most BLCKSZ, so we can allocate the memory for it even before
 knowing the real size of each block image.
 OK, this would save some cycles. I was trying to make process allocate
 a minimum of memory only when necessary.

 -printf( (FPW); hole: offset: %u, length: %u\n,
 -   record-blocks[block_id].hole_offset,
 -   record-blocks[block_id].hole_length);
 +if (record-blocks[block_id].is_compressed)
 +printf( (FPW); hole offset: %u, compressed length 
 %u\n,
 +   record-blocks[block_id].hole_offset,
 +   record-blocks[block_id].bkp_len);
 +else
 +printf( (FPW); hole offset: %u, length: %u\n,
 +   record-blocks[block_id].hole_offset,
 +   record-blocks[block_id].bkp_len);

 We need to consider what info about FPW we want pg_xlogdump to report.
 I'd like to calculate how much bytes FPW was compressed, from the report
 of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
 and that of compressed one in the report.
 OK, so let's add a parameter in the decoder for the uncompressed
 length. Sounds fine?

 In pg_config.h, the comment of BLCKSZ needs to be updated? Because
 the maximum size of BLCKSZ can be affected by not only itemid but also
 XLogRecordBlockImageHeader.
 Check.

  boolhas_image;
 +boolis_compressed;

 Doesn't ResetDecoder need to reset is_compressed?
 Check.

 +#wal_compression = off# enable compression of full-page writes
 Currently wal_compression compresses only FPW, so isn't it better to place
 it after full_page_writes in postgresql.conf.sample?
 Check.

 +uint16extra_data;/* used to store offset of bytes in
 hole, with
 + * last free bit used to check if block is
 + * compressed */
 At least to me, defining something like the following seems more easy to
 read.
 uint16hole_offset:15,
 is_compressed:1
 Check++.

 Updated patches addressing all those things are attached.

Thanks for updating the patch!

Firstly I'm thinking to commit the
0001-Move-pg_lzcompress.c-to-src-common.patch.

pg_lzcompress.h still exists in include/utils, but it should be moved to
include/common?

Do we really need PGLZ_Status? I'm not sure whether your categorization of
the result status of compress/decompress functions is right or not. For example,
pglz_decompress() can return PGLZ_INCOMPRESSIBLE status, but which seems
invalid logically... Maybe this needs to be revisited when we introduce other
compression algorithms and create the wrapper function for those compression
and decompression functions. Anyway making pg_lzdecompress return
the boolean value seems enough.

I updated 0001-Move-pg_lzcompress.c-to-src-common.patch accordingly.
Barring objections, I will push the attached patch firstly.

Regards,

-- 
Fujii Masao
From 92a7c2ac8a6a8ec6ae84c6713f8ad30b7958de94 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Tue, 25 Nov 2014 14:05:59 +0900
Subject: [PATCH] Move pg_lzcompress.c to src/common

Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. pglz_decompress contained a call
to elog to emit an error message in case of corrupted data. This function
is changed to return a status code to let its callers return an error
instead. Compression function is changed similarly to make the whole set
consistent.
---
 src/backend/access/heap/tuptoaster.c  |   11 +-
 src/backend/utils/adt/Makefile|4 +-
 src/backend/utils/adt/pg_lzcompress.c |  779 
 src/common/Makefile   |3 +-
 src/common

Re: [HACKERS] Moving RestoreBlockImage from xlogreader.c to xlogutils.c

2014-12-24 Thread Fujii Masao

On Wed, Dec 24, 2014 at 9:42 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Hi all,

 Commit 2c03216d has introduced RestoreBlockImage to restore a page
 from a given decoding state. ISTM that this is a backend-only
 operation but it has been added in xlogreader.c which could be used as
 well by frontend utilities like pg_xlogdump.
 Wouldn't it be better to declare it as a static routine in
 xlogutils.c? If we keep it in xlogreader.c, I think that we should at
 least wrap it with ifndef FRONTEND.
 Thoughts?

If we do this, pg_lzcompress.c doesn't need to be moved to common for
FPW compression patch which we're talking about in other thread. Right?

DecodeXLogRecord() seems also a backend-only, so we should treat it
in the same way as you proposed? Or pg_rewind uses that?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving RestoreBlockImage from xlogreader.c to xlogutils.c

2014-12-24 Thread Fujii Masao

On Wed, Dec 24, 2014 at 10:41 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Dec 24, 2014 at 10:16 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Dec 24, 2014 at 9:42 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Wouldn't it be better to declare it as a static routine in
 xlogutils.c? If we keep it in xlogreader.c, I think that we should at
 least wrap it with ifndef FRONTEND.

 If we do this, pg_lzcompress.c doesn't need to be moved to common for
 FPW compression patch which we're talking about in other thread. Right?
 Yes. This refactoring came to my mind while re-thinking about the WAL
 compression. This would also make more straight-forward the
 implementation of hooks for compression and decompression.

Fair enough. Anyway I wait for applying the patch which moves pg_lzcompress.c
until we will have reached any consensus about this.

 DecodeXLogRecord() seems also a backend-only, so we should treat it
 in the same way as you proposed? Or pg_rewind uses that?
 DecodeXLogRecord is used by XLogReadRecord, the latter being called by
 pg_xlogdump and also pg_rewind, so it is not backend-only.

Yeah, you're right.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-18 Thread Fujii Masao

On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier
michael.paqu...@gmail.com wrote:


 On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed rahilasye...@gmail.com
 wrote:

 I had a look at code. I have few minor points,

 Thanks!

 +   bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
 +
 +   if (is_compressed)
 {
 -   rdt_datas_last-data = page;
 -   rdt_datas_last-len = BLCKSZ;
 +   /* compressed block information */
 +   bimg.length = compress_len;
 +   bimg.extra_data = hole_offset;
 +   bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;

 For consistency with the existing code , how about renaming the macro
 XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of
 BKPBLOCK_HAS_IMAGE.

 OK, why not...


 +   blk-hole_offset = extra_data  ~XLR_BLCK_COMPRESSED_MASK;
 Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
 more indicative of the fact that lower 15 bits of extra_data field comprises
 of hole_offset value. This suggestion is also just to achieve consistency
 with the existing BKPBLOCK_FORK_MASK for fork_flags field.

 Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK
 though.

 And comment typo
 +* First try to compress block, filling in the page hole with
 zeros
 +* to improve the compression of the whole. If the block is
 considered
 +* as incompressible, complete the block header information as
 if
 +* nothing happened.

 As hole is no longer being compressed, this needs to be changed.

 Fixed. As well as an additional comment block down.

 A couple of things noticed on the fly:
 - Fixed pg_xlogdump being not completely correct to report the FPW
 information
 - A couple of typos and malformed sentences fixed
 - Added an assertion to check that the hole offset value does not the bit
 used for compression status
 - Reworked docs, mentioning as well that wal_compression is off by default.
 - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by
 Fujii-san)

Thanks!

+else
+memcpy(compression_scratch, page, page_len);

I don't think the block image needs to be copied to scratch buffer here.
We can try to compress the page directly.

+#include utils/pg_lzcompress.h
 #include utils/memutils.h

pg_lzcompress.h should be after meutils.h.

+/* Scratch buffer used to store block image to-be-compressed */
+static char compression_scratch[PGLZ_MAX_BLCKSZ];

Isn't it better to allocate the memory for compression_scratch in
InitXLogInsert()
like hdr_scratch?

+uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));

Why don't we allocate the buffer for uncompressed page only once and
keep reusing it like XLogReaderState-readBuf? The size of uncompressed
page is at most BLCKSZ, so we can allocate the memory for it even before
knowing the real size of each block image.

-printf( (FPW); hole: offset: %u, length: %u\n,
-   record-blocks[block_id].hole_offset,
-   record-blocks[block_id].hole_length);
+if (record-blocks[block_id].is_compressed)
+printf( (FPW); hole offset: %u, compressed length %u\n,
+   record-blocks[block_id].hole_offset,
+   record-blocks[block_id].bkp_len);
+else
+printf( (FPW); hole offset: %u, length: %u\n,
+   record-blocks[block_id].hole_offset,
+   record-blocks[block_id].bkp_len);

We need to consider what info about FPW we want pg_xlogdump to report.
I'd like to calculate how much bytes FPW was compressed, from the report
of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
and that of compressed one in the report.

In pg_config.h, the comment of BLCKSZ needs to be updated? Because
the maximum size of BLCKSZ can be affected by not only itemid but also
XLogRecordBlockImageHeader.

 boolhas_image;
+boolis_compressed;

Doesn't ResetDecoder need to reset is_compressed?

+#wal_compression = off# enable compression of full-page writes

Currently wal_compression compresses only FPW, so isn't it better to place
it after full_page_writes in postgresql.conf.sample?

+uint16extra_data;/* used to store offset of bytes in
hole, with
+ * last free bit used to check if block is
+ * compressed */

At least to me, defining something like the following seems more easy to
read.

uint16hole_offset:15,
is_compressed:1

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor improvement to explain.c

2014-12-18 Thread Fujii Masao

On Thu, Dec 18, 2014 at 12:52 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 Hi,

 The attached patch just removes one bad-looking blank line in the
 comments at the top of a function in explain.c.

Applied.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WALWriter active during recovery

2014-12-18 Thread Fujii Masao

On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Currently, WALReceiver writes and fsyncs data it receives. Clearly,
 while we are waiting for an fsync we aren't doing any other useful
 work.

 Following patch starts WALWriter during recovery and makes it
 responsible for fsyncing data, allowing WALReceiver to progress other
 useful actions.

+1

 At present this is a WIP patch, for code comments only. Don't bother
 with anything other than code questions at this stage.

 Implementation questions are

 * How should we wake WALReceiver, since it waits on a poll(). Should
 we use SIGUSR1, which is already used for latch waits, or another
 signal?

Probably we need to change libpqwalreceiver so that it uses the latch.
This is useful even for the startup process to report the replay location to
the walreceiver in real time.

 * Should we introduce some pacing delays if the WALreceiver gets too
 far ahead of apply?

I don't think so for now. Instead, we can support synchronous_commit = replay,
and the users can use that new mode if they are worried about the delay of
WAL replay.

 * Other questions you may have?

Who should wake the startup process so that it reads and replays the WAL data?
Current walreceiver. But if walwriter is responsible for fsyncing WAL data,
probably walwriter should do that. Because the startup process should not replay
the WAL data which has not been fsync'd yet.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming replication and WAL archive interactions

2014-12-18 Thread Fujii Masao

On Wed, Dec 17, 2014 at 4:11 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 12/16/2014 10:24 AM, Borodin Vladimir wrote:

 12 дек. 2014 г., в 16:46, Heikki Linnakangas
 hlinnakan...@vmware.com написал(а):

 There have been a few threads on the behavior of WAL archiving,
 after a standby server is promoted [1] [2]. In short, it doesn't
 work as you might expect. The standby will start archiving after
 it's promoted, but it will not archive files that were replicated
 from the old master via streaming replication. If those files were
 not already archived in the master before the promotion, they are
 not archived at all. That's not good if you wanted to restore from
 a base backup + the WAL archive later.

 The basic setup is a master server, a standby, a WAL archive that's
 shared by both, and streaming replication between the master and
 standby. This should be a very common setup in the field, so how
 are people doing it in practice? Just live with the wisk that you
 might miss some files in the archive if you promote? Don't even
 realize there's a problem? Something else?


 Yes, I do live like that (with streaming replication and shared
 archive between master and replicas) and don’t even realize there’s a
 problem :( And I think I’m not the only one. Maybe at least a note
 should be added to the documentation?


 Let's try to figure out a way to fix this in master, but yeah, a note in the
 documentation is in order.

+1

 And how would we like it to work?


 Here's a plan:

 Have a mechanism in the standby, to track how far the master has archived
 its WAL, and don't throw away WAL in the standby that hasn't been archived
 in the master yet. This is similar to the physical replication slots, which
 prevent the master from recycling WAL that a standby hasn't received yet,
 but in reverse. I think we can use the .done and .ready files for this.
 Whenever a file is streamed (completely) from the master, create a .ready
 file for it. When we get an acknowledgement from the master that it has
 archived it, create a .done file for it. To get the information from the
 master, add the last archived WAL segment e.g. in the streaming
 replication keep-alive message, or invent a new message type for it.

Sounds OK to me.

How does this work in cascade replication case? The cascading walsender
just relays the archive location to the downstream standby?

What happens when WAL streaming is terminated and the startup process starts to
read the WAL file from the archive? After reading the WAL file from the archive,
probably we would need to change .ready files of every older WAL files to .done.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-18 Thread Fujii Masao

On Thu, Dec 18, 2014 at 7:31 PM, Rahila Syed rahilasye...@gmail.com wrote:
Isn't it better to allocate the memory for compression_scratch in
InitXLogInsert()
like hdr_scratch?

 I think making compression_scratch a statically allocated global variable
 is the result of  following discussion earlier,

 http://www.postgresql.org/message-id/ca+tgmoaznbuwnls4bpwyqgqteeznoavy7kwdbm0a2-tbarn...@mail.gmail.com

/*
 * Permanently allocate readBuf.  We do it this way, rather than just
 * making a static array, for two reasons: (1) no need to waste the
 * storage in most instantiations of the backend; (2) a static char array
 * isn't guaranteed to have any particular alignment, whereas palloc()
 * will provide MAXALIGN'd storage.
 */

The above source code comment in XLogReaderAllocate() makes me think that
it's better to avoid using a static array. The point (1) seems less important in
this case because most processes need the buffer for WAL compression,
though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] .gitignore config.cache and comment about regression.(out|diff)

2014-12-18 Thread Fujii Masao

On Tue, Dec 16, 2014 at 9:47 AM, Jim Nasby jim.na...@bluetreble.com wrote:
 config.cache is created when you pass -C to configure, which speeds it up
 considerably (3.5s vs 16.5 on my laptop). It would be nice to just ignore
 the cache file it generates.

 Originally this patch also ignored the regression output files, until I
 realized why that was a bad idea. Add a comment about that.

Applied.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench -f and vacuum

2014-12-18 Thread Fujii Masao

On Sun, Dec 14, 2014 at 11:43 AM, Tatsuo Ishii is...@postgresql.org wrote:
 If we care enough about that case to attempt the vacuum anyway then we
 need to do something about the error message; either squelch it or
 check for the existence of the tables before attempting to
 vacuum. Since there's no way to squelch in the server logfile, I think
 checking for the table is the right answer.

 Fair enough. I will come up with checking for table before vacuum
 approach.

+1 for this approach.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add generate_series(numeric, numeric)

2014-12-18 Thread Fujii Masao

On Mon, Dec 15, 2014 at 12:25 PM, Andrew Gierth
and...@tao11.riddles.org.uk wrote:
 Fujii == Fujii Masao masao.fu...@gmail.com writes:

  Fujii Pushed.

 Bug found:

 regression=# select count(*) from generate_series(1::numeric,10) v, 
 generate_series(1,v) w;
  count
 ---
  0
 (1 row)

 regression=# select count(*) from generate_series(1::numeric,10) v, 
 generate_series(1,v+0) w;
  count
 ---
 55
 (1 row)

 The error is in the use of PG_GETARG_NUMERIC and init_var_from_num
 when setting up the multi-call state; init_var_from_num points at the
 original num's digits rather than copying them, but start_num and
 stop_num have just been (potentially) detoasted in the per-call
 context, in which case the storage will have been freed by the next
 call.

 Obviously this could also be fixed by not detoasting the input until
 after switching to the multi-call context, but it looks to me like
 that would be unnecessarily complex.

 Suggested patch attached.

Pushed. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add generate_series(numeric, numeric)

2014-12-18 Thread Fujii Masao

On Mon, Dec 15, 2014 at 2:38 PM, Andrew Gierth
and...@tao11.riddles.org.uk wrote:
 Ali == Ali Akbar the.ap...@gmail.com writes:

  Ali  I think yes, it will be good. The alternative is restructuring
  Ali this paragraph in the SRF docs:

   The memory context that is current when the SRF is called is a
   transient context that will be cleared between calls. This means
   that you do not need to call pfree on everything you allocated
   using palloc; it will go away anyway. However, if you want to
   allocate any data structures to live across calls, you need to put
   them somewhere else. The memory context referenced by
   multi_call_memory_ctx is a suitable location for any data that
   needs to survive until the SRF is finished running. In most cases,
   this means that you should switch into multi_call_memory_ctx while
   doing the first-call setup.

  Ali The important part However, if you want to allocate any data
  Ali structures to live across calls, you need to put them somewhere
  Ali else. is buried in the docs.

  Ali But i think explicit warning is more noticeable for new
  Ali developer like me.

 I was thinking something like this, added just after that para:

 warning
  para
   While the actual arguments to the function remain unchanged between
   calls, if you detoast the argument values (which is normally done
   transparently by the
   functionPG_GETARG_replaceablexxx/replaceable/function macro)
   in the transient context then the detoasted copies will be freed on
   each cycle. Accordingly, if you keep references to such values in
   your structfielduser_fctx/, you must either copy them into the
   structfieldmulti_call_memory_ctx/ after detoasting, or ensure
   that you detoast the values only in that context.
  /para
 /warning

I'm OK with this.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Function to know last log write timestamp

2014-12-18 Thread Fujii Masao

On Fri, Nov 28, 2014 at 9:07 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Nov 26, 2014 at 4:05 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Aug 15, 2014 at 8:17 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2014-08-14 14:37:22 -0400, Robert Haas wrote:
 On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-08-14 14:19:13 -0400, Robert Haas wrote:
  That's about the idea. However, what you've got there is actually
  unsafe, because shmem-counter++ is not an atomic operation.  It reads
  the counter (possibly even as two separate 4-byte loads if the counter
  is an 8-byte value), increments it inside the CPU, and then writes the
  resulting value back to memory.  If two backends do this concurrently,
  one of the updates might be lost.
 
  All these are only written by one backend, so it should be safe. Note
  that that coding pattern, just without memory barriers, is all over
  pgstat.c

 Ah, OK.  If there's a separate slot for each backend, I agree that it's 
 safe.

 We should probably add barriers to pgstat.c, too.

 Yea, definitely. I think this is rather borked on weaker
 architectures. It's just that the consequences of an out of date/torn
 value are rather low, so it's unlikely to be noticed.

 Imo we should encapsulate the changecount modifications/checks somehow
 instead of repeating the barriers, Asserts, comments et al everywhere.

 So what about applying the attached patch first, which adds the macros
 to load and store the changecount with the memory barries, and changes
 pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4?

 After applying the patch, I will rebase the pg_last_xact_insert_timestamp
 patch and post it again.
 Hm, what's the status on this patch? The addition of those macros to
 control count increment with a memory barrier seems like a good thing
 at least.

 Thanks for reminding me of that! Barring any objection, I will commit it.

Applied.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-17 Thread Fujii Masao

On Wed, Dec 17, 2014 at 1:34 AM, Michael Paquier
michael.paqu...@gmail.com wrote:


 On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:

 Actually, the original length of the compressed block in saved in
 PGLZ_Header, so if we are fine to not check the size of the block
 decompressed when decoding WAL we can do without the hole filled with zeros,
 and use only 1 bit to see if the block is compressed or not.

 And.. After some more hacking, I have been able to come up with a patch that
 is able to compress blocks without the page hole, and that keeps the WAL
 record length the same as HEAD when compression switch is off. The numbers
 are pretty good, CPU is saved in the same proportions as previous patches
 when compression is enabled, and there is zero delta with HEAD when
 compression switch is off.

 Here are the actual numbers:
test_name   | pg_size_pretty | user_diff | system_diff
 ---++---+-
  FPW on + 2 bytes, ffactor 50  | 582 MB | 42.391894 |0.807444
  FPW on + 2 bytes, ffactor 20  | 229 MB | 14.330304 |0.729626
  FPW on + 2 bytes, ffactor 10  | 117 MB |  7.335442 |0.570996
  FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 |1.248503
  FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 |0.755448
  FPW off + 2 bytes, ffactor 10 | 148 MB |  5.762775 |0.763761
  FPW on + 0 bytes, ffactor 50  | 582 MB | 42.174297 |0.790596
  FPW on + 0 bytes, ffactor 20  | 229 MB | 14.424233 |0.770459
  FPW on + 0 bytes, ffactor 10  | 117 MB |  7.057195 |0.584806
  FPW off + 0 bytes, ffactor 50 | 746 MB | 25.261998 |1.054516
  FPW off + 0 bytes, ffactor 20 | 293 MB | 10.589888 |0.860207
  FPW off + 0 bytes, ffactor 10 | 148 MB |  5.827191 |0.874285
  HEAD, ffactor 50  | 746 MB | 25.181729 |1.133433
  HEAD, ffactor 20  | 293 MB |  9.962242 |0.765970
  HEAD, ffactor 10  | 148 MB |  5.693426 |0.775371
  Record, ffactor 50| 582 MB | 54.904374 |0.678204
  Record, ffactor 20| 229 MB | 19.798268 |0.807220
  Record, ffactor 10| 116 MB |  9.401877 |0.668454
 (18 rows)

 The new tests of this patch are FPW off + 0 bytes. Patches as well as
 results are attached.

I think that neither pg_control nor xl_parameter_change need to have the info
about WAL compression because each backup block has that entry.

Will review the remaining part later.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup -x/X doesn't play well with archive_mode wal_keep_segments

2014-12-04 Thread Fujii Masao

On Fri, Dec 5, 2014 at 9:28 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 We've recently observed a case where, after a promotion, a postgres
 server suddenly started to archive a large amount of old WAL.

 After some digging the problem is this:

 pg_basebackup -X creates files in pg_xlog/ without creating the
 corresponding .done file. Note that walreceiver *does* create them. The
 standby in this case, just like the master, had a significant
 wal_keep_segments.  RemoveOldXlogFiles() then, during recovery restart
 points, calls XLogArchiveCheckDone() which in turn does:
 /* Retry creation of the .ready file */
 XLogArchiveNotify(xlog);
 return false;
 if there's neither a .done nor a .ready file present and archive_mode is
 enabled. These segments then aren't removed because there's a .ready
 present and they're never archived as long as the node is a standby
 because we don't do archiving on standbys.
 Once the node is promoted archiver will be started and suddenly archive
 all these files - which might be months old.

 And additional, at first strange, nice detail is that a lot of the
 .ready files had nearly the same timestamps. Turns out that's due to
 wal_keep_segments. Initially RemoveOldXlogFiles() doesn't process the
 files because they're newer than allowed due to wal_keep_segments. Then
 every checkpoint a couple segments would be old enough to reach
 XLogArchiveCheckDone() which then'd create the .ready marker... But not
 all at once :)


 So I think we just need to make pg_basebackup create to .ready
 files.

s/.ready/.done? If yes, +1.

 Given that the walreceiver and restore_command already
 unconditionally do XLogArchiveForceDone() I think we'd follow the
 established precedent. Arguably it could make sense to archive files
 again on the standby after a promotion as they aren't guaranteed to have
 been on the then primary. But we don't have any infrastructure anyway
 for that and walsender doesn't do so, so it doesn't seem to make any
 sense to do that for pg_basebackup.

 Independent from this bug, there's also some debatable behaviour about
 what happens if a node with a high wal_keep_segments turns on
 archive_mode. Suddenly all those old files are archived... I think it
 might be a good idea to simply always create .done files when
 archive_mode is disabled while a wal segment is finished.

+1

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-04 Thread Fujii Masao

On Thu, Dec 4, 2014 at 8:37 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Dec 4, 2014 at 7:36 PM, Rahila Syed rahilasyed...@gmail.com wrote:
 IIUC, forcibly written fpws are not exposed to user , so is it worthwhile to
 add a GUC similar to full_page_writes in order to control a feature which is
 unexposed to user in first place?

 If full page writes is set 'off' by user, user probably cannot afford the
 overhead involved in writing large pages to disk . So , if a full page write
 is forcibly written in such a situation it is better to compress it before
 writing to alleviate the drawbacks of writing full_page_writes in servers
 with heavy write load.

 The only scenario in which a user would not want to compress forcibly
 written pages is when CPU utilization is high. But according to measurements
 done earlier the CPU utilization  of compress='on' and 'off' are not
 significantly different.

 Yes they are not visible to the user still they exist. I'd prefer that we have
 a safety net though to prevent any problems that may occur if compression
 algorithm has a bug as if we enforce compression for forcibly-written blocks
 all the backups of our users would be impacted.

 I pondered something that Andres mentioned upthread: we may not do the
 compression in WAL record only for blocks, but also at record level. Hence
 joining the two ideas together I think that we should definitely have
 a different
 GUC to control the feature, consistently for all the images. Let's call it
 wal_compression, with the following possible values:
 - on, meaning that a maximum of compression is done, for this feature
 basically full_page_writes = on.
 - full_page_writes, meaning that full page writes are compressed
 - off, default value, to disable completely the feature.
 This would let room for another mode: 'record', to completely compress
 a record. For now though, I think that a simple on/off switch would be
 fine for this patch. Let's keep things simple.

+1

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-12-03 Thread Fujii Masao

On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Pushed with some extra cosmetic tweaks.

I got the following assertion failure when I executed pg_xact_commit_timestamp()
in the standby server.

=# select pg_xact_commit_timestamp('1000'::xid);
TRAP: FailedAssertion(!(((oldestCommitTs) != ((TransactionId) 0)) ==
((newestCommitTs) != ((TransactionId) 0))), File: commit_ts.c,
Line: 315)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2014-12-04
12:01:08 JST sby1 LOG:  server process (PID 15545) was terminated by
signal 6: Aborted
2014-12-04 12:01:08 JST sby1 DETAIL:  Failed process was running:
select pg_xact_commit_timestamp('1000'::xid);

The way to reproduce this problem is

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the standby
#5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby

BTW, at the step #4, I got the following log messages. This might be a hint for
this problem.

LOG:  file pg_commit_ts/ doesn't exist, reading as zeroes
CONTEXT:  xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09;
inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache
45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608
relcache 16384

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-12-03 Thread Fujii Masao

On Thu, Dec 4, 2014 at 12:58 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 4 December 2014 at 03:08, Fujii Masao masao.fu...@gmail.com wrote:

 #1. set up and start the master and standby servers with
 track_commit_timestamp disabled
 #2. enable track_commit_timestamp in the master and restart the master
 #3. run some write transactions
 #4. enable track_commit_timestamp in the standby and restart the standby
 #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby

 I'm not sure what step4 is supposed to do?

 Surely if steps 1-3 generate any WAL then the standby should replay
 it, whether or not track_commit_timestamp is enabled.

 So what effect does setting that parameter on the standby?

At least track_commit_timestamp seems to need to be enabled even in the standby
when we want to call pg_xact_commit_timestamp() and pg_last_committed_xact()
in the standby. I'm not sure if this is good design, though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The problems of PQhost()

2014-11-28 Thread Fujii Masao

On Fri, Nov 28, 2014 at 3:43 AM, Noah Misch n...@leadboat.com wrote:
 On Fri, Nov 28, 2014 at 03:11:06AM +0900, Fujii Masao wrote:
 On Thu, Nov 27, 2014 at 12:38 PM, Noah Misch n...@leadboat.com wrote:
  Sure.  I'll first issue git revert 9f80f48, then apply the attached 
  patch.
  Since libpq ignores a hostaddr parameter equal to the empty string, this
  implementation does likewise.  Apart from that, I anticipate behavior
  identical to today's code.

 +fprintf(stderr, _(out of memory\n));

 psql_error() should be used instead of fprintf()?

 I copied what pg_malloc() would do.  Either way seems reasonable.

psql_error() seems better for the case where psql executes the
specified input file.
In this case, psql_error reports the message in the format like
psql:filename:lineno: message.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Function to know last log write timestamp

2014-11-28 Thread Fujii Masao

On Wed, Nov 26, 2014 at 4:05 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Aug 15, 2014 at 8:17 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2014-08-14 14:37:22 -0400, Robert Haas wrote:
 On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-08-14 14:19:13 -0400, Robert Haas wrote:
  That's about the idea. However, what you've got there is actually
  unsafe, because shmem-counter++ is not an atomic operation.  It reads
  the counter (possibly even as two separate 4-byte loads if the counter
  is an 8-byte value), increments it inside the CPU, and then writes the
  resulting value back to memory.  If two backends do this concurrently,
  one of the updates might be lost.
 
  All these are only written by one backend, so it should be safe. Note
  that that coding pattern, just without memory barriers, is all over
  pgstat.c

 Ah, OK.  If there's a separate slot for each backend, I agree that it's 
 safe.

 We should probably add barriers to pgstat.c, too.

 Yea, definitely. I think this is rather borked on weaker
 architectures. It's just that the consequences of an out of date/torn
 value are rather low, so it's unlikely to be noticed.

 Imo we should encapsulate the changecount modifications/checks somehow
 instead of repeating the barriers, Asserts, comments et al everywhere.

 So what about applying the attached patch first, which adds the macros
 to load and store the changecount with the memory barries, and changes
 pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4?

 After applying the patch, I will rebase the pg_last_xact_insert_timestamp
 patch and post it again.
 Hm, what's the status on this patch? The addition of those macros to
 control count increment with a memory barrier seems like a good thing
 at least.

Thanks for reminding me of that! Barring any objection, I will commit it.

 The 2nd patch has not been rebased but still..

The feature that this 2nd patch implements is very similar to a part of
what the committs patch does, i.e., tracking the timestamps of the committed
transactions. If the committs patch will have been committed, basically
I'd like to no longer work on the 2nd patch to avoid the duplicate work.
OTOH, I'm concerned about the performance impact by the committs patch.
So, for the simple use case like the check of replication lag, what the 2nd
patch implements seems to be better, though...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_isready: Missing translation macros.

2014-11-27 Thread Fujii Masao

On Thu, Nov 27, 2014 at 9:32 PM, Mats Erik Andersson b...@gisladisker.se 
wrote:
 Hello there,

 the text response of pg_isready is hard coded in English.
 These short snippets really ought to be localized as well.

Thanks for the patch! Committed.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql \watch always ignores \pset null

2014-11-27 Thread Fujii Masao

On Wed, Nov 19, 2014 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Will Leinweber w...@heroku.com writes:
 On Tue, Nov 18, 2014 at 9:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 Is there any reason why \watch must ignore \pset null setting?

 Hmmm ... the comment offers a reasonable argument for forcing pager = 0,
 but I agree the nullPrint change is not adequately explained.
 Will, do you remember why you did that?

 I tracked down the individual commit[1] from my history where I added
 that. What I added there is very similar to sections in
 src/bin/psql/describe.c. I can't remember specifically my reasoning
 then, but it's likely I copied the patterns there while getting things
 working.
 I do still think it's important to remove the pager, but the nullPrint
 is probably a mistake.

 I took a quick look and noted that the other places where nullPrint is
 summarily forced to null are for \d and similar queries.  For those,
 the code can reasonably have an opinion about what the presentation should
 be like, since it knows what SQL query it's issuing.  That argument surely
 doesn't apply to \watch, so I'm in agreement with Fujii that it'd be
 better to respect the user's \pset setting.

Thanks! I've just fixed this problem.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why is PG_AUTOCONF_FILENAME is pg_config_manual.h?

2014-11-27 Thread Fujii Masao

On Thu, Nov 27, 2014 at 11:58 PM, Peter Eisentraut pete...@gmx.net wrote:
 Surely that's not a value that we expect users to be able to edit.  Is
 pg_config_manual.h just abused as a place that's included everywhere?

 (I suggest utils/guc.h as a better place.)

+1

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The problems of PQhost()

2014-11-27 Thread Fujii Masao

On Thu, Nov 27, 2014 at 12:38 PM, Noah Misch n...@leadboat.com wrote:
 On Tue, Nov 25, 2014 at 09:53:10PM +0900, Fujii Masao wrote:
 On Tue, Nov 25, 2014 at 12:37 PM, Noah Misch n...@leadboat.com wrote:
  On Wed, Jan 22, 2014 at 11:48:26PM +0900, Fujii Masao wrote:
  (3) PQhost() cannot return the hostaddr.
 
  We can fix the problem (3) by changing PQhost() so that it also
  returns the hostaddr. But this change might break the existing
  application using PQhost(). So, I added new libpq function PQhostaddr()
  which returns the hostaddr, and changed \conninfo so that it reports
  correct connection information by using both PQhost() and PQhostaddr().
 
  + varlistentry id=libpq-pqhostaddr
  +  term
  +   functionPQhostaddr/function
  +   indexterm
  +primaryPQhostaddr/primary
  +   /indexterm
  +  /term
  +
  +  listitem
  +   para
  +Returns the server numeric IP address or host name of the 
  connection.
  + synopsis
  + char *PQhostaddr(const PGconn *conn);
  + /synopsis
  +   /para
  +  /listitem
  + /varlistentry
 
  From reading this documentation, I assumed this function would return a
  non-empty value for every TCP connection.  After all, every TCP connection 
  has
  a peer IP address.  In fact, PQhostaddr() returns the raw value of the
  hostaddr connection parameter, whether from a libpq function argument or
  from the PGHOSTADDR environment variable.  (If the parameter and 
  environment
  variable are absent, it returns NULL.  Adding hostaddr= to the connection
  string makes it return the empty string.)  A caller wanting the specific 
  raw
  value of a parameter could already use PQconninfo().  I suspect this new
  function will confuse more than help.  What do you think of reverting it 
  and
  having \conninfo use PQconninfo() to discover any hostaddr value?

 Sounds reasonable to me. Are you planning to implement and commit the patch?

 Sure.  I'll first issue git revert 9f80f48, then apply the attached patch.
 Since libpq ignores a hostaddr parameter equal to the empty string, this
 implementation does likewise.  Apart from that, I anticipate behavior
 identical to today's code.

+fprintf(stderr, _(out of memory\n));

psql_error() should be used instead of fprintf()?

+if (host == NULL)/* can't happen */

Is this comment true? ISTM that host can be NULL when the default socket
directory is used, i.e., neither host nor hostaddr are supplied by the user.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-11-26 Thread Fujii Masao

On Wed, Nov 26, 2014 at 5:15 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 25 November 2014 at 14:06, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Nov 20, 2014 at 5:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 19 November 2014 16:41, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Nov 19, 2014 at 10:49 PM, Robert Haas robertmh...@gmail.com 
 wrote:
 On Wed, Nov 19, 2014 at 8:13 AM, Simon Riggs si...@2ndquadrant.com 
 wrote:
 If we ask for PAUSE and we're not in HotStandby it just ignores it,
 which means it changes into PROMOTE. My feeling is that we should
 change that into a SHUTDOWN, not a PROMOTE.

 To me, that seems like a definite improvement.

 But changing the default will force us to set
 action_at_recovery_target to 'promote'
 when we want to just recover the database to the specified point.

 If you explicitly request pause and then can't pause, ISTM the action
 we actually perform shouldn't be the exact opposite of what was
 requested.

 So if the user explicitly requests pause and we aren't in HS then they
 should get Shutdown, which is closest action.

 If the user doesn't request anything at all then we can default to
 Promote, just like we do now.

 Yes, this is what I was trying to suggest. +1 to do that.

 Implemented.

 Patch committed.

Isn't it better to add the sample setting of this parameter into
recovery.conf.sample?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The problems of PQhost()

2014-11-25 Thread Fujii Masao

On Tue, Nov 25, 2014 at 12:37 PM, Noah Misch n...@leadboat.com wrote:
 On Wed, Jan 22, 2014 at 11:48:26PM +0900, Fujii Masao wrote:
 (3) PQhost() cannot return the hostaddr.

 We can fix the problem (3) by changing PQhost() so that it also
 returns the hostaddr. But this change might break the existing
 application using PQhost(). So, I added new libpq function PQhostaddr()
 which returns the hostaddr, and changed \conninfo so that it reports
 correct connection information by using both PQhost() and PQhostaddr().

 + varlistentry id=libpq-pqhostaddr
 +  term
 +   functionPQhostaddr/function
 +   indexterm
 +primaryPQhostaddr/primary
 +   /indexterm
 +  /term
 +
 +  listitem
 +   para
 +Returns the server numeric IP address or host name of the 
 connection.
 + synopsis
 + char *PQhostaddr(const PGconn *conn);
 + /synopsis
 +   /para
 +  /listitem
 + /varlistentry

 From reading this documentation, I assumed this function would return a
 non-empty value for every TCP connection.  After all, every TCP connection has
 a peer IP address.  In fact, PQhostaddr() returns the raw value of the
 hostaddr connection parameter, whether from a libpq function argument or
 from the PGHOSTADDR environment variable.  (If the parameter and environment
 variable are absent, it returns NULL.  Adding hostaddr= to the connection
 string makes it return the empty string.)  A caller wanting the specific raw
 value of a parameter could already use PQconninfo().  I suspect this new
 function will confuse more than help.  What do you think of reverting it and
 having \conninfo use PQconninfo() to discover any hostaddr value?

Sounds reasonable to me. Are you planning to implement and commit the patch?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-11-25 Thread Fujii Masao

On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 And here is v10 which fixes conflicts with Heikki's WAL API changes (no
 changes otherwise).

 After some slight additional changes, here's v11, which I intend to
 commit early tomorrow.  The main change is moving the test module from
 contrib to src/test/modules.

When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(),
it always returns 2000-01-01 09:00:00+09. Is this intentional?

Can I check my understanding? Probably we cannot use this feature to calculate
the actual replication lag by, for example, comparing the result of
pg_last_committed_xact() in the master and that of
pg_last_xact_replay_timestamp()
in the standby. Because pg_last_xact_replay_timestamp() can return even
the timestamp of aborted transaction, but pg_last_committed_xact()
cannot. Right?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 2751 matches

Mail list logo