Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-28 Thread Magnus Hagander
On Tue, Jan 22, 2013 at 7:31 AM, Amit Kapila amit.kap...@huawei.com wrote:
 On Monday, January 21, 2013 6:22 PM Magnus Hagander
 On Fri, Jan 18, 2013 at 7:50 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:
  On 07.01.2013 16:23, Boszormenyi Zoltan wrote:
   Since my other patch against pg_basebackup is now committed,
   this patch doesn't apply cleanly, patch rejects 2 hunks.
   The fixed up patch is attached.
 
  Now that I look at this a high-level perspective, why are we only
  worried about timeouts in the Copy-mode and when connecting? The
  initial
  checkpoint could take a long time too, and if the server turns into
 a
  black hole while the checkpoint is running, pg_basebackup will still
  hang. Then again, a short timeout on that phase would be a bad idea,
  because the checkpoint can indeed take a long time.
 
  True, but IMO, if somebody want to take basebackup, he should do that
 when
  the server is not loaded.

 A lot of installations don't have such an optino, because there is no
 time whe nthe server is not loaded.

 Good to know about it.
 I have always heard that customer will run background maintenance activities
 (Reindex, Vacuum Full, etc) when the server is less loaded.
 For example
 a. Billing applications in telecom, at night times they can be relatively
 less loaded.

That assumes there is a nighttime.. If you're operating in enough
timezones, that won't happen.


 b. Any databases used for Sensex transactions, they will be relatively free
 once the market is closed.
 c. Banking solutions, because transactions are done mostly in day times.

True. But those are definitely very very narrow usecases ;)

Don't get me wrong. There are a *lot* of people who have nighttimes to
do maintenance in. They are the lucky ones :) But we can't assume this
scenario.


 There will be many cases where Database server will be loaded all the times,
 if you can give some example, it will be a good learning for me.

Most internet based businesses that do business in multiple countries.
Or really, any business that has customers in multiple timezones
across the world. And even more to the point, any business who's
*customers* have customers in multiple timezones across the world,
provided they are services-based.


  In streaming replication, the keep-alive messages carry additional
  information, the timestamps and WAL locations, so a keepalive makes
  sense at that level. But otherwise, aren't we just trying to
  reimplement
  TCP keepalives? TCP keepalives are not perfect, but if we want to
 have
  an application level timeout, it should be implemented in the FE/BE
  protocol.
 
  I don't think we need to do anything specific to pg_basebackup. The
  user
  can simply specify TCP keepalive settings in the connection string,
  like
  with any libpq program.
 
  I think currently user has no way to specify TCP keepalive settings
 from
  pg_basebackup, please let me know if there is any such existing way?

 You can set it through environment variables. As was discussed
 elsewhere, it would be good to have the ability to do it natively to
 pg_basebackup as well.

 Sure, already modifying the existing patch to support connection string in
 pg_basebackup and pg_receivexlog.

Good.


  I think specifying TCP settings is very cumbersome for most users,
 that's
  the reason most standard interfaces (ODBC/JDBC) have such application
 level
  timeout mechanism.
 
  By implementing in FE/BE protocol (do you mean to say that make such
  non-blocking behavior inside Libpq or something else), it might be
 generic
  and can be used for others as well but it might need few interface
 changes.

 If it's specifying them that is cumbersome, then that's the part we
 should fix, rather than modifying the protocol, no?

 That can be done as part of point 2 of initial proposal
 (2. Support recv_timeout separately to provide a way to users who are not
 comfortable tcp keepalives).

Looking at the bigger picture, we should in that case support those on
*all* our frontend applications, and not just pg_basebackup. To me, it
makes more sense to just say use the connection string method to
connect when you need to set these parameters. There are always going
to be some parameters that require that.


 To achieve this there can be 2 ways.
 1. Change in FE/BE protocol - I am not sure exactly how this can be done,
 but as per Heikki this is better way of implementing it.
 2. Make the socket as non-blocking in pg_basebackup.

 Advantage of Approach-1 is that if we do in such a fashion that in lower
 layers (libpq) it is addressed then all other apps (pg_basebackup, etc) can
 use it, no need to handle separately in each application.

 So now as changes in Approach-1 seems to be invasive, we decided to do it
 later.

Ok - I haven't really been following the thread, but that doesn't seem
unreasonable. The thing I was objecting to is putting in special
parameters to 

Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-21 Thread Magnus Hagander
On Fri, Jan 18, 2013 at 7:50 AM, Amit Kapila amit.kap...@huawei.com wrote:
 On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:
 On 07.01.2013 16:23, Boszormenyi Zoltan wrote:
  Since my other patch against pg_basebackup is now committed,
  this patch doesn't apply cleanly, patch rejects 2 hunks.
  The fixed up patch is attached.

 Now that I look at this a high-level perspective, why are we only
 worried about timeouts in the Copy-mode and when connecting? The
 initial
 checkpoint could take a long time too, and if the server turns into a
 black hole while the checkpoint is running, pg_basebackup will still
 hang. Then again, a short timeout on that phase would be a bad idea,
 because the checkpoint can indeed take a long time.

 True, but IMO, if somebody want to take basebackup, he should do that when
 the server is not loaded.

A lot of installations don't have such an optino, because there is no
time whe nthe server is not loaded.


 In streaming replication, the keep-alive messages carry additional
 information, the timestamps and WAL locations, so a keepalive makes
 sense at that level. But otherwise, aren't we just trying to
 reimplement
 TCP keepalives? TCP keepalives are not perfect, but if we want to have
 an application level timeout, it should be implemented in the FE/BE
 protocol.

 I don't think we need to do anything specific to pg_basebackup. The
 user
 can simply specify TCP keepalive settings in the connection string,
 like
 with any libpq program.

 I think currently user has no way to specify TCP keepalive settings from
 pg_basebackup, please let me know if there is any such existing way?

You can set it through environment variables. As was discussed
elsewhere, it would be good to have the ability to do it natively to
pg_basebackup as well.


 I think specifying TCP settings is very cumbersome for most users, that's
 the reason most standard interfaces (ODBC/JDBC) have such application level
 timeout mechanism.

 By implementing in FE/BE protocol (do you mean to say that make such
 non-blocking behavior inside Libpq or something else), it might be generic
 and can be used for others as well but it might need few interface changes.

If it's specifying them that is cumbersome, then that's the part we
should fix, rather than modifying the protocol, no?


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-21 Thread Amit Kapila
On Monday, January 21, 2013 6:22 PM Magnus Hagander
 On Fri, Jan 18, 2013 at 7:50 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:
  On 07.01.2013 16:23, Boszormenyi Zoltan wrote:
   Since my other patch against pg_basebackup is now committed,
   this patch doesn't apply cleanly, patch rejects 2 hunks.
   The fixed up patch is attached.
 
  Now that I look at this a high-level perspective, why are we only
  worried about timeouts in the Copy-mode and when connecting? The
  initial
  checkpoint could take a long time too, and if the server turns into
 a
  black hole while the checkpoint is running, pg_basebackup will still
  hang. Then again, a short timeout on that phase would be a bad idea,
  because the checkpoint can indeed take a long time.
 
  True, but IMO, if somebody want to take basebackup, he should do that
 when
  the server is not loaded.
 
 A lot of installations don't have such an optino, because there is no
 time whe nthe server is not loaded.

Good to know about it. 
I have always heard that customer will run background maintenance activities
(Reindex, Vacuum Full, etc) when the server is less loaded. 
For example 
a. Billing applications in telecom, at night times they can be relatively
less loaded.
b. Any databases used for Sensex transactions, they will be relatively free
once the market is closed.
c. Banking solutions, because transactions are done mostly in day times.

There will be many cases where Database server will be loaded all the times,
if you can give some example, it will be a good learning for me.

  In streaming replication, the keep-alive messages carry additional
  information, the timestamps and WAL locations, so a keepalive makes
  sense at that level. But otherwise, aren't we just trying to
  reimplement
  TCP keepalives? TCP keepalives are not perfect, but if we want to
 have
  an application level timeout, it should be implemented in the FE/BE
  protocol.
 
  I don't think we need to do anything specific to pg_basebackup. The
  user
  can simply specify TCP keepalive settings in the connection string,
  like
  with any libpq program.
 
  I think currently user has no way to specify TCP keepalive settings
 from
  pg_basebackup, please let me know if there is any such existing way?
 
 You can set it through environment variables. As was discussed
 elsewhere, it would be good to have the ability to do it natively to
 pg_basebackup as well.

Sure, already modifying the existing patch to support connection string in
pg_basebackup and pg_receivexlog.

 
  I think specifying TCP settings is very cumbersome for most users,
 that's
  the reason most standard interfaces (ODBC/JDBC) have such application
 level
  timeout mechanism.
 
  By implementing in FE/BE protocol (do you mean to say that make such
  non-blocking behavior inside Libpq or something else), it might be
 generic
  and can be used for others as well but it might need few interface
 changes.
 
 If it's specifying them that is cumbersome, then that's the part we
 should fix, rather than modifying the protocol, no?

That can be done as part of point 2 of initial proposal
(2. Support recv_timeout separately to provide a way to users who are not
comfortable tcp keepalives).

To achieve this there can be 2 ways.
1. Change in FE/BE protocol - I am not sure exactly how this can be done,
but as per Heikki this is better way of implementing it.
2. Make the socket as non-blocking in pg_basebackup.

Advantage of Approach-1 is that if we do in such a fashion that in lower
layers (libpq) it is addressed then all other apps (pg_basebackup, etc) can
use it, no need to handle separately in each application.

So now as changes in Approach-1 seems to be invasive, we decided to do it
later. 

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-17 Thread Amit Kapila
On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:
 On 07.01.2013 16:23, Boszormenyi Zoltan wrote:
  Since my other patch against pg_basebackup is now committed,
  this patch doesn't apply cleanly, patch rejects 2 hunks.
  The fixed up patch is attached.
 
 Now that I look at this a high-level perspective, why are we only
 worried about timeouts in the Copy-mode and when connecting? The
 initial
 checkpoint could take a long time too, and if the server turns into a
 black hole while the checkpoint is running, pg_basebackup will still
 hang. Then again, a short timeout on that phase would be a bad idea,
 because the checkpoint can indeed take a long time.

True, but IMO, if somebody want to take basebackup, he should do that when
the server is not loaded. 
 
 In streaming replication, the keep-alive messages carry additional
 information, the timestamps and WAL locations, so a keepalive makes
 sense at that level. But otherwise, aren't we just trying to
 reimplement
 TCP keepalives? TCP keepalives are not perfect, but if we want to have
 an application level timeout, it should be implemented in the FE/BE
 protocol.
 
 I don't think we need to do anything specific to pg_basebackup. The
 user
 can simply specify TCP keepalive settings in the connection string,
 like
 with any libpq program.

I think currently user has no way to specify TCP keepalive settings from
pg_basebackup, please let me know if there is any such existing way?

I think specifying TCP settings is very cumbersome for most users, that's
the reason most standard interfaces (ODBC/JDBC) have such application level
timeout mechanism.

By implementing in FE/BE protocol (do you mean to say that make such
non-blocking behavior inside Libpq or something else), it might be generic
and can be used for others as well but it might need few interface changes.

IMHO if by having such less impact changes for pg_basebackup, it makes
pg_basebackup network sensitive, the current approach can also be
considered.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-16 Thread Heikki Linnakangas

On 07.01.2013 16:23, Boszormenyi Zoltan wrote:

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.


Now that I look at this a high-level perspective, why are we only 
worried about timeouts in the Copy-mode and when connecting? The initial 
checkpoint could take a long time too, and if the server turns into a 
black hole while the checkpoint is running, pg_basebackup will still 
hang. Then again, a short timeout on that phase would be a bad idea, 
because the checkpoint can indeed take a long time.


In streaming replication, the keep-alive messages carry additional 
information, the timestamps and WAL locations, so a keepalive makes 
sense at that level. But otherwise, aren't we just trying to reimplement 
TCP keepalives? TCP keepalives are not perfect, but if we want to have 
an application level timeout, it should be implemented in the FE/BE 
protocol.


I don't think we need to do anything specific to pg_basebackup. The user 
can simply specify TCP keepalive settings in the connection string, like 
with any libpq program.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-08 Thread Hari Babu
On January 07, 2013 7:53 PM Boszormenyi Zoltan wrote:
Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Patch is verified. Thanks for rebasing the patch. 

Regards,
Hari babu.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-07 Thread Boszormenyi Zoltan

2013-01-04 13:43 keltezéssel, Hari Babu írta:

On January 02, 2013 12:41 PM Hari Babu wrote:

On January 01, 2013 10:19 PM Boszormenyi Zoltan wrote:

I am reviewing your patch.
• Is the patch in context diff format?
Yes.

Thanks for reviewing the patch.


• Does it apply cleanly to the current git master?
Not quite cleanly but it doesn't produce rejects or fuzz, only offset

warnings:

Will rebase the patch to head.


• Does it include reasonable tests, necessary doc patches, etc?
The test cases are not applicable. There is no test framework for
testing network outage in make check.

There are no documentation patches for the new --recvtimeout=INTERVAL
and --conntimeout=INTERVAL options for either pg_basebackup or
pg_receivexlog.

I will add the documentation for the same.


Per the previous comment, no. But those are for the backend
to notice network breakdowns and as such, they need a
separate patch.

I also think it is better to handle it as a separate patch for walsender.


• Are the comments sufficient and accurate?
This chunk below removes a comment which seems obvious enough
so it's not needed:
***
*** 518,524  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos,

uint32 timeline,

 goto error;
 }
   
!   /* Check the message type. */

 if (copybuf[0] == 'k')
 {
 int pos;
--- 559,568 
 goto error;
 }
   
!   /* Set the last reply timestamp */

!   last_recv_timestamp = localGetCurrentTimestamp();
!   ping_sent = false;
!
 if (copybuf[0] == 'k')
 {
 int pos;
***

Other comments are sufficient and accurate.

I will fix and update the patch.

The attached V2 patch in the mail handles all the review comments identified
above.

Regards,
Hari babu.


Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Best regards,
Zoltán Böszörményi

--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/

diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml postgresql/doc/src/sgml/ref/pg_basebackup.sgml
*** postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-05 17:34:30.742135371 +0100
--- postgresql/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-07 15:11:40.787007890 +0100
*** PostgreSQL documentation
*** 400,405 
--- 400,425 
   /varlistentry
  
   varlistentry
+   termoption-r replaceable class=parameterinterval/replaceable/option/term
+   termoption--recvtimeout=replaceable class=parameterinterval/replaceable/option/term
+   listitem
+para
+ time that receiver waits for communication from server (in seconds).
+/para
+   /listitem
+  /varlistentry 
+ 
+  varlistentry
+   termoption-t replaceable class=parameterinterval/replaceable/option/term
+   termoption--conntimeout=replaceable class=parameterinterval/replaceable/option/term
+   listitem
+para
+ time that client wait for connection to establish with server (in seconds).
+/para
+   /listitem
+  /varlistentry  
+ 
+  varlistentry
termoption-U replaceableusername/replaceable/option/term
termoption--username=replaceable class=parameterusername/replaceable/option/term
listitem
diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_receivexlog.sgml postgresql/doc/src/sgml/ref/pg_receivexlog.sgml
*** postgresql.orig/doc/src/sgml/ref/pg_receivexlog.sgml	2012-11-08 13:13:04.152630639 +0100
--- postgresql/doc/src/sgml/ref/pg_receivexlog.sgml	2013-01-07 15:11:40.788007898 +0100
*** PostgreSQL documentation
*** 164,169 
--- 164,189 
   /varlistentry
  
   varlistentry
+   termoption-r replaceable class=parameterinterval/replaceable/option/term
+   termoption--recvtimeout=replaceable class=parameterinterval/replaceable/option/term
+   listitem
+para
+ time that receiver waits for communication from server (in seconds).
+/para
+   /listitem
+  /varlistentry 
+ 
+  varlistentry
+   termoption-t replaceable class=parameterinterval/replaceable/option/term
+   termoption--conntimeout=replaceable class=parameterinterval/replaceable/option/term
+   listitem
+para
+ time that client wait for connection to establish with server (in seconds).
+/para
+   /listitem
+  /varlistentry  
+  
+  varlistentry
termoption-U replaceableusername/replaceable/option/term
termoption--username=replaceable class=parameterusername/replaceable/option/term
 

Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-04 Thread Hari Babu

On January 02, 2013 12:41 PM Hari Babu wrote:
On January 01, 2013 10:19 PM Boszormenyi Zoltan wrote:
I am reviewing your patch.
• Is the patch in context diff format? 
Yes.

Thanks for reviewing the patch.

• Does it apply cleanly to the current git master?
Not quite cleanly but it doesn't produce rejects or fuzz, only offset
warnings:

Will rebase the patch to head.

• Does it include reasonable tests, necessary doc patches, etc? 
The test cases are not applicable. There is no test framework for
testing network outage in make check.

There are no documentation patches for the new --recvtimeout=INTERVAL
and --conntimeout=INTERVAL options for either pg_basebackup or
pg_receivexlog.

I will add the documentation for the same.

Per the previous comment, no. But those are for the backend
to notice network breakdowns and as such, they need a
separate patch. 

I also think it is better to handle it as a separate patch for walsender.

• Are the comments sufficient and accurate?
This chunk below removes a comment which seems obvious enough
so it's not needed:
***
*** 518,524  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos,
uint32 timeline,
    goto error;
    }
  
!   /* Check the message type. */
    if (copybuf[0] == 'k')
    {
    int pos;
--- 559,568 
    goto error;
    }
  
!   /* Set the last reply timestamp */
!   last_recv_timestamp = localGetCurrentTimestamp();
!   ping_sent = false;
!   
    if (copybuf[0] == 'k')
    {
    int pos;
***

Other comments are sufficient and accurate.

I will fix and update the patch.

The attached V2 patch in the mail handles all the review comments identified
above.

Regards,
Hari babu.



pg_basebkup_recvxlog_noblock_comm_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-01 Thread Boszormenyi Zoltan

Hi,

2012-11-15 14:59 keltezéssel, Amit kapila írta:

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila amit.kap...@huawei.com wrote:

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila amit.kap...@huawei.com
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Are you planning to introduce the timeout mechanism in pg_basebackup
main process? Or background process? It's useful to implement both.

By background process, you mean ReceiveXlogStream?
For both.
I think for background process, it can be done in a way similar to what we
have done for walreceiver.

Yes.

But I have some doubts for how to do for main process:
Logic similar to walreceiver can not be used incase network goes down during
getting other database file from server.
The reason for the same is to receive the data files PQgetCopyData() is
called in synchronous mode, so it keeps waiting for infinite time till it
gets some data.
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).

+1
Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
solve the issue in the similar way to walreceiver's.

2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
wait), we can send some finite time. This time can be received as command
line argument
 from respective utility and set the same in PGconn structure.

Yes, I think that we should add something like --conninfo option to
pg_basebackup
and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
application_name, ... by using such option accepting conninfo string.

I have prepared an attached patch to make pg_basebackup and pg_receivexlog as 
non-blocking.
To do so I have to add new command line parameters in pg_basebackup and 
pg_receivexlog
for now added two more command line arguments
 a.  -r  for pg_basebackup and pg_receivexlog to take receive 
time-out value. Default value for this parameter is 60 sec.
 b. -t   for pg_basebackup and pg_receivexlog to take initial 
connection timeout value. Default value is infinite wait.
We can change to accept --conninfo as well.

I feel apart from above, remaining problem is for function call PQgetResult()
1. Wherever query is getting sent from BaseBackup, it calls the function 
PQgetResult to receive the result of query.
 As PQgetResult() is blocking function (it calls pqWait which can hang), so 
if network is down before sending the query itself,
 then there will not be any result, so it will keep hanging in PQgetResult .
IMO, it can be solved in below ways:
a. Create one corresponding non-blocking function. But this function is being 
called from inside some of the
  other libpq function (PQexec-PQexecFinish-PQgetResult). So it can be 
little tricky to solve this way.
b. Add the receive_timeout variable in PGconn structure and use it in pqWait 
for timeout whenever it is set.
c. any other better way?



BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the
timeout mechanism for the walsender during backup.

What about using pq_putmessage_noblock()?

I think may be some more functions also needs to be made as noblock. I am still 
evaluating.

I will upload the attached patch in commitfest if you don't have any objections?

More Suggestions/Comments?

With Regards,
Amit Kapila.


I am reviewing your patch.

 * Is the patch in context diff format 
http://en.wikipedia.org/wiki/Diff#Context_format?


Yes.

 * Does it apply cleanly to the current git master?


Not quite cleanly but it doesn't produce rejects or fuzz, only offset warnings:

[zozo@localhost postgresql]$ cat ../noblock_basebackup_and_receivexlog.patch | 
patch -p1
patching file src/bin/pg_basebackup/pg_basebackup.c
Hunk #1 succeeded at 41 (offset -6 lines).
Hunk #2 succeeded at 123 (offset -6 lines).
Hunk #3 succeeded at 239 (offset -6 lines).
Hunk #4 succeeded at 292 (offset -6 lines).
Hunk #5 succeeded at 470 (offset -6 lines).
Hunk #6 succeeded at 588 (offset -6 lines).
Hunk #7 succeeded at 601 (offset -6 lines).
Hunk #8 succeeded at 727 (offset -6 lines).
Hunk #9 succeeded at 779 (offset -6 lines).
Hunk #10 succeeded at 797 (offset -6 lines).
Hunk #11 succeeded at 811 (offset -6 lines).
Hunk #12 succeeded at 879 (offset -6 lines).
Hunk #13 succeeded at 1080 (offset -6 lines).
Hunk #14 succeeded at 1381 (offset -6 lines).
Hunk #15 succeeded at 1409 (offset -6 lines).
Hunk #16 succeeded at 1521 (offset -6 lines).
patching file src/bin/pg_basebackup/pg_receivexlog.c
Hunk #1 succeeded at 35 (offset -6 lines).
Hunk #2 succeeded at 65 (offset -6 lines).
Hunk #3 succeeded at 224 (offset -6 lines).
Hunk #4 succeeded at 281 (offset -6 

Re: Review of pg_basebackup and pg_receivexlog to use non-blocking socket communication, was: Re: [HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-01 Thread Hari Babu
On January 01, 2013 10:19 PM Boszormenyi Zoltan wrote:
I am reviewing your patch.
• Is the patch in context diff format? 
Yes.

Thanks for reviewing the patch.

• Does it apply cleanly to the current git master?
Not quite cleanly but it doesn't produce rejects or fuzz, only offset
warnings:

Will rebase the patch to head.

• Does it include reasonable tests, necessary doc patches, etc? 
The test cases are not applicable. There is no test framework for
testing network outage in make check.

There are no documentation patches for the new --recvtimeout=INTERVAL
and --conntimeout=INTERVAL options for either pg_basebackup or
pg_receivexlog.

I will add the documentation for the same.


Per the previous comment, no. But those are for the backend
to notice network breakdowns and as such, they need a
separate patch. 

I also think it is better to handle it as a separate patch for walsender.

• Are the comments sufficient and accurate?
This chunk below removes a comment which seems obvious enough
so it's not needed:
***
*** 518,524  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos,
uint32 timeline,
    goto error;
    }
  
!   /* Check the message type. */
    if (copybuf[0] == 'k')
    {
    int pos;
--- 559,568 
    goto error;
    }
  
!   /* Set the last reply timestamp */
!   last_recv_timestamp = localGetCurrentTimestamp();
!   ping_sent = false;
!   
    if (copybuf[0] == 'k')
    {
    int pos;
***

Other comments are sufficient and accurate.

I will fix and update the patch.

Please let me know if anything apart from above needs to be taken care.

Regards,
Hari babu.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers