subject:"Re\: \[HACKERS\] Streaming base backups"

Re: [HACKERS] Streaming base backups

2011-01-17 Thread Dimitri Fontaine

Magnus Hagander mag...@hagander.net writes:
 With pg_basebackup, you can set up streaming replication in what's
 basically a single command (run the base backup, copy i na
 recovery.conf file). In my first version I even had a switch that
 would create the recovery.conf file for you - should we bring that
 back?

+1.  Well, make it optional maybe?

 It does require you to set a reasonable wal_keep_segments, though,
 but that's really all you need to do on the master side.

Until we get integrated WAL streaming while the base backup is ongoing.
We don't know when that is (9.1 or future), but that's what we're aiming
to now, right?

 What Fujii-san unsuccessfully proposed was to have the master restore
 segments from the archive and stream them to clients, on request.  It
 was deemed better to have the slave obtain them from the archive
 directly.

 Did Fuji-san agreed on the conclusion?

 I can see the point of the mastering being able to do this, but it
 seems like a pretty narrow usecase, really. I think we invented
 wal_keep_segments partially to solve this problem in a neater way?

Well I still think that the easier setup we can offer here is to ship
with integrated libpq based archive and restore commands.  Those could
be bin/pg_walsender and bin/pg_walreceiver.  They would have some
switches to make them suitable for running in subprocesses of either the
base backup utility or the default libpq based archive daemon.

Again, all of that is not forcibly material for 9.1, despite having all
the pieces already coded and tested, mainly in Magnus hands.  But could
we get agreement about going this route?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-17 Thread Magnus Hagander

On Mon, Jan 17, 2011 at 11:18, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Magnus Hagander mag...@hagander.net writes:
 With pg_basebackup, you can set up streaming replication in what's
 basically a single command (run the base backup, copy i na
 recovery.conf file). In my first version I even had a switch that
 would create the recovery.conf file for you - should we bring that
 back?

 +1.  Well, make it optional maybe?

It has always been optional. Basically it just creates a recovery.conf file with
primary_conninfo=whatever pg_streamrecv was using
standby_mode=on


 It does require you to set a reasonable wal_keep_segments, though,
 but that's really all you need to do on the master side.

 Until we get integrated WAL streaming while the base backup is ongoing.
 We don't know when that is (9.1 or future), but that's what we're aiming
 to now, right?

Yeah, it does sound like a plan. But to still allow both - streaming
it in parallell will eat two connections, and I'm sure some people
might consider that a higher cost.


 What Fujii-san unsuccessfully proposed was to have the master restore
 segments from the archive and stream them to clients, on request.  It
 was deemed better to have the slave obtain them from the archive
 directly.

 Did Fuji-san agreed on the conclusion?

 I can see the point of the mastering being able to do this, but it
 seems like a pretty narrow usecase, really. I think we invented
 wal_keep_segments partially to solve this problem in a neater way?

 Well I still think that the easier setup we can offer here is to ship
 with integrated libpq based archive and restore commands.  Those could
 be bin/pg_walsender and bin/pg_walreceiver.  They would have some
 switches to make them suitable for running in subprocesses of either the
 base backup utility or the default libpq based archive daemon.

Not sure why they'd run as an archive command and not like now as a
replication client - but let's keep that out of this thread and in a
new one :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-17 Thread Dimitri Fontaine

Magnus Hagander mag...@hagander.net writes:
 Until we get integrated WAL streaming while the base backup is ongoing.
 We don't know when that is (9.1 or future), but that's what we're aiming
 to now, right?

 Yeah, it does sound like a plan. But to still allow both - streaming
 it in parallell will eat two connections, and I'm sure some people
 might consider that a higher cost.

Sure.  Ah, tradeoffs :)

 Well I still think that the easier setup we can offer here is to ship
 with integrated libpq based archive and restore commands.  Those could
 be bin/pg_walsender and bin/pg_walreceiver.  They would have some
 switches to make them suitable for running in subprocesses of either the
 base backup utility or the default libpq based archive daemon.

 Not sure why they'd run as an archive command and not like now as a
 replication client - but let's keep that out of this thread and in a
 new one :)

On the archive side you're right that it's not necessary, but it would
be to cater for the restore side.  Sure enough, thinking about it some
more, what we would like here is for the standby to be able to talk to
the archive server (pg_streamsendrecv) rather than the primary, in order
to offload it.  Ok scratch all that and get cascading support instead :)

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-16 Thread Robert Haas

On Sat, Jan 15, 2011 at 8:33 PM, Tatsuo Ishii is...@postgresql.org wrote:
 When do the standby launch its walreceiver? It would be extra-nice for
 the base backup tool to optionally continue streaming WALs until the
 standby starts doing it itself, so that wal_keep_segments is really
 deprecated.  No idea how feasible that is, though.

 Good point. I have been always wondering why we can't use exiting WAL
 transporting infrastructure for sending/receiving WAL archive
 segments in streaming replication.
 If my memory serves, Fujii has already proposed such an idea but was
 rejected for some reason I don't understand.

I must be confused, because you can use backup_command/restore_command
to transport WAL segments, in conjunction with streaming replication.

What Fujii-san unsuccessfully proposed was to have the master restore
segments from the archive and stream them to clients, on request.  It
was deemed better to have the slave obtain them from the archive
directly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-16 Thread Tatsuo Ishii

 Good point. I have been always wondering why we can't use exiting WAL
 transporting infrastructure for sending/receiving WAL archive
 segments in streaming replication.
 If my memory serves, Fujii has already proposed such an idea but was
 rejected for some reason I don't understand.
 
 I must be confused, because you can use backup_command/restore_command
 to transport WAL segments, in conjunction with streaming replication.

Yes, but using restore_command is not terribly convenient. On
Linux/UNIX systems you have to enable ssh access, which is extremely
hard on Windows.

IMO Streaming replication is not yet easy enough to set up for
ordinary users. It is already proposed that making base backup easier
and I think it's good. Why don't we go step beyond a little bit more?

 What Fujii-san unsuccessfully proposed was to have the master restore
 segments from the archive and stream them to clients, on request.  It
 was deemed better to have the slave obtain them from the archive
 directly.

Did Fuji-san agreed on the conclusion?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-16 Thread Fujii Masao

On Mon, Jan 17, 2011 at 11:32 AM, Tatsuo Ishii is...@postgresql.org wrote:
 Good point. I have been always wondering why we can't use exiting WAL
 transporting infrastructure for sending/receiving WAL archive
 segments in streaming replication.
 If my memory serves, Fujii has already proposed such an idea but was
 rejected for some reason I don't understand.

 I must be confused, because you can use backup_command/restore_command
 to transport WAL segments, in conjunction with streaming replication.

 Yes, but using restore_command is not terribly convenient. On
 Linux/UNIX systems you have to enable ssh access, which is extremely
 hard on Windows.

Agreed.

 IMO Streaming replication is not yet easy enough to set up for
 ordinary users. It is already proposed that making base backup easier
 and I think it's good. Why don't we go step beyond a little bit more?

 What Fujii-san unsuccessfully proposed was to have the master restore
 segments from the archive and stream them to clients, on request.  It
 was deemed better to have the slave obtain them from the archive
 directly.

 Did Fuji-san agreed on the conclusion?

No. If the conclusion is true, we would not need a streaming backup feature.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-16 Thread Magnus Hagander

On Mon, Jan 17, 2011 at 03:32, Tatsuo Ishii is...@postgresql.org wrote:
 Good point. I have been always wondering why we can't use exiting WAL
 transporting infrastructure for sending/receiving WAL archive
 segments in streaming replication.
 If my memory serves, Fujii has already proposed such an idea but was
 rejected for some reason I don't understand.

 I must be confused, because you can use backup_command/restore_command
 to transport WAL segments, in conjunction with streaming replication.

 Yes, but using restore_command is not terribly convenient. On
 Linux/UNIX systems you have to enable ssh access, which is extremely
 hard on Windows.

Agreed.


 IMO Streaming replication is not yet easy enough to set up for
 ordinary users. It is already proposed that making base backup easier
 and I think it's good. Why don't we go step beyond a little bit more?

With pg_basebackup, you can set up streaming replication in what's
basically a single command (run the base backup, copy i na
recovery.conf file). In my first version I even had a switch that
would create the recovery.conf file for you - should we bring that
back?

It does require you to set a reasonable wal_keep_segments, though,
but that's really all you need to do on the master side.


 What Fujii-san unsuccessfully proposed was to have the master restore
 segments from the archive and stream them to clients, on request.  It
 was deemed better to have the slave obtain them from the archive
 directly.

 Did Fuji-san agreed on the conclusion?

I can see the point of the mastering being able to do this, but it
seems like a pretty narrow usecase, really. I think we invented
wal_keep_segments partially to solve this problem in a neater way?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Heikki Linnakangas


On 14.01.2011 13:38, Magnus Hagander wrote:

On Fri, Jan 14, 2011 at 11:19, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 14.01.2011 08:45, Fujii Masao wrote:

1. Smart shutdown is requested while walsender is sending a backup.
2. Shutdown causes archiver to end.
  (Though shutdown sends SIGUSR2 to walsender to exit, walsender
   running backup doesn't respond for now)
3. At the end of backup, walsender calls do_pg_stop_backup, which
  forces a switch to a new WAL file and waits until the last WAL file
has
  been archived.
  *BUT*, since archiver has already been dead, walsender waits for
  that forever.


Not only does it wait forever, but it writes the end-of-backup WAL record
after bgwriter has already exited and written the shutdown checkpoint
record.

I think postmaster should treat a walsender as a regular backend, until it
has started streaming.

We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
changing the state to PM_CHILD_WALSENDER later, when streaming is started.
Looking at the postmaster.c, that should be safe, postmaster will treat a
backend as a regular backend anyway until it has connected to shared memory.
It is *not* safe to switch a walsender back to a regular process, but we
have no need to do that.


Seems reasonable to me.

I've applied a patch that exits base backups when the postmaster is
shutting down - I'm happily waiting for Heikki to submit one that
changes the shutdown logic in the postmaster :-)


Ok, committed a fix for that.

BTW, I just spotted a small race condition between creating a new table 
space and base backup. We take a snapshot of all the tablespaces in 
pg_tblspc before calling pg_start_backup(). If someone creates a new 
tablespace and puts some data in it in the window between base backup 
acquiring the list tablespaces and starting the backup, the new 
tablespace won't be included in the backup.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 BTW, I just spotted a small race condition between creating a new table 
 space and base backup. We take a snapshot of all the tablespaces in 
 pg_tblspc before calling pg_start_backup(). If someone creates a new 
 tablespace and puts some data in it in the window between base backup 
 acquiring the list tablespaces and starting the backup, the new 
 tablespace won't be included in the backup.

So what?  The needed actions will be covered by WAL replay.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Heikki Linnakangas


On 15.01.2011 17:30, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

BTW, I just spotted a small race condition between creating a new table
space and base backup. We take a snapshot of all the tablespaces in
pg_tblspc before calling pg_start_backup(). If someone creates a new
tablespace and puts some data in it in the window between base backup
acquiring the list tablespaces and starting the backup, the new
tablespace won't be included in the backup.


So what?  The needed actions will be covered by WAL replay.


No, they won't, if pg_base_backup() is called *after* getting the list 
of tablespaces.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 15.01.2011 17:30, Tom Lane wrote:
 So what?  The needed actions will be covered by WAL replay.

 No, they won't, if pg_base_backup() is called *after* getting the list 
 of tablespaces.

Ah.  Then the fix is to change the order in which those things are done.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Magnus Hagander

On Sat, Jan 15, 2011 at 16:54, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 15.01.2011 17:30, Tom Lane wrote:
 So what?  The needed actions will be covered by WAL replay.

 No, they won't, if pg_base_backup() is called *after* getting the list
 of tablespaces.

 Ah.  Then the fix is to change the order in which those things are done.

Grumble. It used to be that way. For some reason I can't recall, I broke it.

Something like this to fix? or is this going to put those warnings by
stupid versions of gcc back?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1ed5e2a..b4d5bbe 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -40,7 +40,7 @@ static void send_int8_string(StringInfoData *buf, int64 intval);
 static void SendBackupHeader(List *tablespaces);
 static void SendBackupDirectory(char *location, char *spcoid);
 static void base_backup_cleanup(int code, Datum arg);
-static void perform_base_backup(const char *backup_label, List *tablespaces);
+static void perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir);
 
 typedef struct
 {
@@ -67,13 +67,50 @@ base_backup_cleanup(int code, Datum arg)
  * clobbered by longjmp from stupider versions of gcc.
  */
 static void
-perform_base_backup(const char *backup_label, List *tablespaces)
+perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir)
 {
 	do_pg_start_backup(backup_label, true);
 
 	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 	{
+		List	   *tablespaces = NIL;
 		ListCell   *lc;
+		struct dirent *de;
+		tablespaceinfo *ti;
+
+
+		/* Add a node for the base directory */
+		ti = palloc0(sizeof(tablespaceinfo));
+		ti-size = progress ? sendDir(., 1, true) : -1;
+		tablespaces = lappend(tablespaces, ti);
+
+		/* Collect information about all tablespaces */
+		while ((de = ReadDir(tblspcdir, pg_tblspc)) != NULL)
+		{
+			char		fullpath[MAXPGPATH];
+			char		linkpath[MAXPGPATH];
+
+			/* Skip special stuff */
+			if (strcmp(de-d_name, .) == 0 || strcmp(de-d_name, ..) == 0)
+continue;
+
+			snprintf(fullpath, sizeof(fullpath), pg_tblspc/%s, de-d_name);
+
+			MemSet(linkpath, 0, sizeof(linkpath));
+			if (readlink(fullpath, linkpath, sizeof(linkpath) - 1) == -1)
+			{
+ereport(WARNING,
+		(errmsg(unable to read symbolic link %s: %m, fullpath)));
+continue;
+			}
+
+			ti = palloc(sizeof(tablespaceinfo));
+			ti-oid = pstrdup(de-d_name);
+			ti-path = pstrdup(linkpath);
+			ti-size = progress ? sendDir(linkpath, strlen(linkpath), true) : -1;
+			tablespaces = lappend(tablespaces, ti);
+		}
+
 
 		/* Send tablespace header */
 		SendBackupHeader(tablespaces);
@@ -101,9 +138,6 @@ void
 SendBaseBackup(const char *backup_label, bool progress)
 {
 	DIR		   *dir;
-	struct dirent *de;
-	List	   *tablespaces = NIL;
-	tablespaceinfo *ti;
 	MemoryContext backup_context;
 	MemoryContext old_context;
 
@@ -134,41 +168,10 @@ SendBaseBackup(const char *backup_label, bool progress)
 		ereport(ERROR,
 (errmsg(unable to open directory pg_tblspc: %m)));
 
-	/* Add a node for the base directory */
-	ti = palloc0(sizeof(tablespaceinfo));
-	ti-size = progress ? sendDir(., 1, true) : -1;
-	tablespaces = lappend(tablespaces, ti);
-
-	/* Collect information about all tablespaces */
-	while ((de = ReadDir(dir, pg_tblspc)) != NULL)
-	{
-		char		fullpath[MAXPGPATH];
-		char		linkpath[MAXPGPATH];
-
-		/* Skip special stuff */
-		if (strcmp(de-d_name, .) == 0 || strcmp(de-d_name, ..) == 0)
-			continue;
-
-		snprintf(fullpath, sizeof(fullpath), pg_tblspc/%s, de-d_name);
-
-		MemSet(linkpath, 0, sizeof(linkpath));
-		if (readlink(fullpath, linkpath, sizeof(linkpath) - 1) == -1)
-		{
-			ereport(WARNING,
-  (errmsg(unable to read symbolic link %s: %m, fullpath)));
-			continue;
-		}
+	perform_base_backup(backup_label, progress, dir);
 
-		ti = palloc(sizeof(tablespaceinfo));
-		ti-oid = pstrdup(de-d_name);
-		ti-path = pstrdup(linkpath);
-		ti-size = progress ? sendDir(linkpath, strlen(linkpath), true) : -1;
-		tablespaces = lappend(tablespaces, ti);
-	}
 	FreeDir(dir);
 
-	perform_base_backup(backup_label, tablespaces);
-
 	MemoryContextSwitchTo(old_context);
 	MemoryContextDelete(backup_context);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tom Lane

Magnus Hagander mag...@hagander.net writes:
 Something like this to fix? or is this going to put those warnings by
 stupid versions of gcc back?

Possibly.  If so, I'll fix it --- I have an old gcc to test against
here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Magnus Hagander

On Sat, Jan 15, 2011 at 19:27, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 Something like this to fix? or is this going to put those warnings by
 stupid versions of gcc back?

 Possibly.  If so, I'll fix it --- I have an old gcc to test against
 here.

Ok, thanks, I'll commit tihs one then.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tatsuo Ishii

 When do the standby launch its walreceiver? It would be extra-nice for
 the base backup tool to optionally continue streaming WALs until the
 standby starts doing it itself, so that wal_keep_segments is really
 deprecated.  No idea how feasible that is, though.

Good point. I have been always wondering why we can't use exiting WAL
transporting infrastructure for sending/receiving WAL archive
segments in streaming replication.
If my memory serves, Fujii has already proposed such an idea but was
rejected for some reason I don't understand.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-14 Thread Heikki Linnakangas


On 14.01.2011 08:45, Fujii Masao wrote:

On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagandermag...@hagander.net  wrote:

At the end of the backup by walsender, it forces a switch to a new
WAL file and waits until the last WAL file has been archived. So we
should change postmaster so that it doesn't cause the archiver to
end before walsender ends when shutdown is requested?


Um. I have to admit I'm not entirely following what you mean enough to
confirm it, but it *sounds* correct :-)

What scenario exactly is the problematic one?


1. Smart shutdown is requested while walsender is sending a backup.
2. Shutdown causes archiver to end.
  (Though shutdown sends SIGUSR2 to walsender to exit, walsender
   running backup doesn't respond for now)
3. At the end of backup, walsender calls do_pg_stop_backup, which
  forces a switch to a new WAL file and waits until the last WAL file has
  been archived.
  *BUT*, since archiver has already been dead, walsender waits for
  that forever.


Not only does it wait forever, but it writes the end-of-backup WAL 
record after bgwriter has already exited and written the shutdown 
checkpoint record.


I think postmaster should treat a walsender as a regular backend, until 
it has started streaming.


We can achieve that by starting up the child as PM_CHILD_ACTIVE, and 
changing the state to PM_CHILD_WALSENDER later, when streaming is 
started. Looking at the postmaster.c, that should be safe, postmaster 
will treat a backend as a regular backend anyway until it has connected 
to shared memory. It is *not* safe to switch a walsender back to a 
regular process, but we have no need to do that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-14 Thread Magnus Hagander

On Fri, Jan 14, 2011 at 11:19, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 14.01.2011 08:45, Fujii Masao wrote:

 On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagandermag...@hagander.net
  wrote:

 At the end of the backup by walsender, it forces a switch to a new
 WAL file and waits until the last WAL file has been archived. So we
 should change postmaster so that it doesn't cause the archiver to
 end before walsender ends when shutdown is requested?

 Um. I have to admit I'm not entirely following what you mean enough to
 confirm it, but it *sounds* correct :-)

 What scenario exactly is the problematic one?

 1. Smart shutdown is requested while walsender is sending a backup.
 2. Shutdown causes archiver to end.
      (Though shutdown sends SIGUSR2 to walsender to exit, walsender
       running backup doesn't respond for now)
 3. At the end of backup, walsender calls do_pg_stop_backup, which
      forces a switch to a new WAL file and waits until the last WAL file
 has
      been archived.
      *BUT*, since archiver has already been dead, walsender waits for
      that forever.

 Not only does it wait forever, but it writes the end-of-backup WAL record
 after bgwriter has already exited and written the shutdown checkpoint
 record.

 I think postmaster should treat a walsender as a regular backend, until it
 has started streaming.

 We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
 changing the state to PM_CHILD_WALSENDER later, when streaming is started.
 Looking at the postmaster.c, that should be safe, postmaster will treat a
 backend as a regular backend anyway until it has connected to shared memory.
 It is *not* safe to switch a walsender back to a regular process, but we
 have no need to do that.

Seems reasonable to me.

I've applied a patch that exits base backups when the postmaster is
shutting down - I'm happily waiting for Heikki to submit one that
changes the shutdown logic in the postmaster :-)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-13 Thread Magnus Hagander

On Wed, Jan 12, 2011 at 10:39, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jan 10, 2011 at 11:09 PM, Magnus Hagander mag...@hagander.net wrote:
 I've committed the backend side of this, without that. Still working
 on the client, and on cleaning up Heikki's patch for grammar/parser
 support.

 Great work!

 I have some comments:

 While walsender is sending a base backup, WalSndWakeup should
 not send the signal to that walsender?

True, it's not necessary. How bad does it actually hurt things though?
Given that the walsender running the backup isn't actually waiting on
the latch, it doesn't actually send a signal, does it?


 In sendFile or elsewhere, we should periodically check whether
 postmaster is alive and whether the flag was set by the signal?

That, however, we probably should.


 At the end of the backup by walsender, it forces a switch to a new
 WAL file and waits until the last WAL file has been archived. So we
 should change postmaster so that it doesn't cause the archiver to
 end before walsender ends when shutdown is requested?

Um. I have to admit I'm not entirely following what you mean enough to
confirm it, but it *sounds* correct :-)

What scenario exactly is the problematic one?


 Also, when shutdown is requested, the walsender which is
 streaming WAL should not end before another walsender which
 is sending a backup ends, to stream the backup-end WAL?

Not sure I see the reason for that. If we're shutting down in the
middle of the base backup, we don't have any support for continuing
that one after we're back up - you have to start over.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-13 Thread Fujii Masao

On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagander mag...@hagander.net wrote:
 While walsender is sending a base backup, WalSndWakeup should
 not send the signal to that walsender?

 True, it's not necessary. How bad does it actually hurt things though?
 Given that the walsender running the backup isn't actually waiting on
 the latch, it doesn't actually send a signal, does it?

Yeah, you are right. Once WalSndWakeup sends the signal to walsender,
latch-is_set is set. So, then WalSndWakeup does nothing against that
walsender until latch-is_set is reset. Since ResetLatch is not called while
walsender is sending a base backup, that would be harmless.

 At the end of the backup by walsender, it forces a switch to a new
 WAL file and waits until the last WAL file has been archived. So we
 should change postmaster so that it doesn't cause the archiver to
 end before walsender ends when shutdown is requested?

 Um. I have to admit I'm not entirely following what you mean enough to
 confirm it, but it *sounds* correct :-)

 What scenario exactly is the problematic one?

1. Smart shutdown is requested while walsender is sending a backup.
2. Shutdown causes archiver to end.
 (Though shutdown sends SIGUSR2 to walsender to exit, walsender
  running backup doesn't respond for now)
3. At the end of backup, walsender calls do_pg_stop_backup, which
 forces a switch to a new WAL file and waits until the last WAL file has
 been archived.
 *BUT*, since archiver has already been dead, walsender waits for
 that forever.

 Also, when shutdown is requested, the walsender which is
 streaming WAL should not end before another walsender which
 is sending a backup ends, to stream the backup-end WAL?

 Not sure I see the reason for that. If we're shutting down in the
 middle of the base backup, we don't have any support for continuing
 that one after we're back up - you have to start over.

For now, shutdown is designed to cause walsender to end after
sending all the WAL records. So I thought that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-12 Thread Fujii Masao

On Mon, Jan 10, 2011 at 11:09 PM, Magnus Hagander mag...@hagander.net wrote:
 I've committed the backend side of this, without that. Still working
 on the client, and on cleaning up Heikki's patch for grammar/parser
 support.

Great work!

I have some comments:

While walsender is sending a base backup, WalSndWakeup should
not send the signal to that walsender?

In sendFile or elsewhere, we should periodically check whether
postmaster is alive and whether the flag was set by the signal?

At the end of the backup by walsender, it forces a switch to a new
WAL file and waits until the last WAL file has been archived. So we
should change postmaster so that it doesn't cause the archiver to
end before walsender ends when shutdown is requested?

Also, when shutdown is requested, the walsender which is
streaming WAL should not end before another walsender which
is sending a backup ends, to stream the backup-end WAL?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Magnus Hagander

On Tue, Jan 11, 2011 at 01:28, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 2011/1/10 Magnus Hagander mag...@hagander.net:
 On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 I've committed the backend side of this, without that. Still working
 on the client, and on cleaning up Heikki's patch for grammar/parser
 support.

 attached is a small patch fixing -d basedir when its called with an
 absolute path.
 maybe we can use pg_mkdir_p() instead of mkdir ?

Heh, that was actually a hack to be able to run pg_basebackup on the
same machine as the database with the tablespaces. It will be removed
before commit :-) (It was also in the wrong place to work, I realize I
managed to break it in a refactor) I've put in a big ugly comment to
make sure it gets removed :-)

And yes, using pg_mkdir_p() is good. I used to do that, I think I
removed it by mistake when it was supposed to be removed elsewhere.
I've put it back.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Garick Hamlin

On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
 On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
  2011/1/7 Magnus Hagander mag...@hagander.net:
  On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
  cedric.villemain.deb...@gmail.com wrote:
  2011/1/5 Magnus Hagander mag...@hagander.net:
  On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
  wrote:
  Magnus Hagander mag...@hagander.net writes:
  * Stefan mentiond it might be useful to put some
  posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long 
  as that
    doesn't kick them out of the cache *completely*, for other backends 
  as well.
    Do we know if that is the case?
 
  Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
  not already in SHM?
 
  I think that's way more complex than we want to go here.
 
 
  DONTNEED will remove the block from OS buffer everytime.
 
  Then we definitely don't want to use it - because some other backend
  might well want the file. Better leave it up to the standard logic in
  the kernel.
 
  Looking at the patch, it is (very) easy to add the support for that in
  basebackup.c
  That supposed allowing mincore(), so mmap(), and so probably switch
  the fopen() to an open() (or add an open() just for mmap
  requirement...)
 
  Let's go ?
 
 Per above, I still don't think we *should* do this. We don't want to
 kick things out of the cache underneath other backends, and since we
 can't control that. Either way, it shouldn't happen in the beginning,
 and if it does, should be backed with proper benchmarks.

Another option that occurs to me is an option to use direct IO (or another
means as needed) to bypass the cache.  So rather than kicking it out of 
the cache, we attempt just not to pollute the cache by bypassing it for cold
pages and use either normal io for 'hot pages', or use a 'read()' to heat 
the cache afterward.

Garick

 
 I've committed the backend side of this, without that. Still working
 on the client, and on cleaning up Heikki's patch for grammar/parser
 support.
 
 -- 
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Cédric Villemain

2011/1/11 Garick Hamlin gham...@isc.upenn.edu:
 On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
 On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
  2011/1/7 Magnus Hagander mag...@hagander.net:
  On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
  cedric.villemain.deb...@gmail.com wrote:
  2011/1/5 Magnus Hagander mag...@hagander.net:
  On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
  wrote:
  Magnus Hagander mag...@hagander.net writes:
  * Stefan mentiond it might be useful to put some
  posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as 
  long as that
    doesn't kick them out of the cache *completely*, for other backends 
  as well.
    Do we know if that is the case?
 
  Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
  not already in SHM?
 
  I think that's way more complex than we want to go here.
 
 
  DONTNEED will remove the block from OS buffer everytime.
 
  Then we definitely don't want to use it - because some other backend
  might well want the file. Better leave it up to the standard logic in
  the kernel.
 
  Looking at the patch, it is (very) easy to add the support for that in
  basebackup.c
  That supposed allowing mincore(), so mmap(), and so probably switch
  the fopen() to an open() (or add an open() just for mmap
  requirement...)
 
  Let's go ?

 Per above, I still don't think we *should* do this. We don't want to
 kick things out of the cache underneath other backends, and since we
 can't control that. Either way, it shouldn't happen in the beginning,
 and if it does, should be backed with proper benchmarks.

 Another option that occurs to me is an option to use direct IO (or another
 means as needed) to bypass the cache.  So rather than kicking it out of
 the cache, we attempt just not to pollute the cache by bypassing it for cold
 pages and use either normal io for 'hot pages', or use a 'read()' to heat
 the cache afterward.

AFAIR, even Linus is rejecting the idea to use it seriously, except if
I shuffle in my memory.


 Garick


 I've committed the backend side of this, without that. Still working
 on the client, and on cleaning up Heikki's patch for grammar/parser
 support.

 --
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Garick Hamlin

On Tue, Jan 11, 2011 at 11:39:20AM -0500, Cédric Villemain wrote:
 2011/1/11 Garick Hamlin gham...@isc.upenn.edu:
  On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
  On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
  cedric.villemain.deb...@gmail.com wrote:
   2011/1/7 Magnus Hagander mag...@hagander.net:
   On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
   cedric.villemain.deb...@gmail.com wrote:
   2011/1/5 Magnus Hagander mag...@hagander.net:
   On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine 
   dimi...@2ndquadrant.fr wrote:
   Magnus Hagander mag...@hagander.net writes:
   * Stefan mentiond it might be useful to put some
   posix_fadvise(POSIX_FADV_DONTNEED)
     in the process that streams all the files out. Seems useful, as 
   long as that
     doesn't kick them out of the cache *completely*, for other 
   backends as well.
     Do we know if that is the case?
  
   Maybe have a look at pgfincore to only tag DONTNEED for blocks that 
   are
   not already in SHM?
  
   I think that's way more complex than we want to go here.
  
  
   DONTNEED will remove the block from OS buffer everytime.
  
   Then we definitely don't want to use it - because some other backend
   might well want the file. Better leave it up to the standard logic in
   the kernel.
  
   Looking at the patch, it is (very) easy to add the support for that in
   basebackup.c
   That supposed allowing mincore(), so mmap(), and so probably switch
   the fopen() to an open() (or add an open() just for mmap
   requirement...)
  
   Let's go ?
 
  Per above, I still don't think we *should* do this. We don't want to
  kick things out of the cache underneath other backends, and since we
  can't control that. Either way, it shouldn't happen in the beginning,
  and if it does, should be backed with proper benchmarks.
 
  Another option that occurs to me is an option to use direct IO (or another
  means as needed) to bypass the cache.  So rather than kicking it out of
  the cache, we attempt just not to pollute the cache by bypassing it for cold
  pages and use either normal io for 'hot pages', or use a 'read()' to heat
  the cache afterward.
 
 AFAIR, even Linus is rejecting the idea to use it seriously, except if
 I shuffle in my memory.

Direct IO is generally a pain.

POSIX_FADV_NOREUSE is an alternative (I think).  Realistically I wasn't sure 
which
way(s) actually worked.  My gut was that direct io would likely work right on 
Linux
and Solaris, at least.  If POSIX_FADV_NOREUSE works than maybe that is the 
answer
instead, but I haven't tested either.

Garick


 
 
  Garick
 
 
  I've committed the backend side of this, without that. Still working
  on the client, and on cleaning up Heikki's patch for grammar/parser
  support.
 
  --
   Magnus Hagander
   Me: http://www.hagander.net/
   Work: http://www.redpill-linpro.com/
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
 
 
 
 -- 
 Cédric Villemain               2ndQuadrant
 http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Florian Pflug

On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
 My gut was that direct io would likely work right on Linux
 and Solaris, at least.

Didn't we discover recently that O_DIRECT fails for ext4 on linux
if ordered=data, or something like that?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Cédric Villemain

2011/1/11 Garick Hamlin gham...@isc.upenn.edu:
 On Tue, Jan 11, 2011 at 11:39:20AM -0500, Cédric Villemain wrote:
 2011/1/11 Garick Hamlin gham...@isc.upenn.edu:
  On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
  On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
  cedric.villemain.deb...@gmail.com wrote:
   2011/1/7 Magnus Hagander mag...@hagander.net:
   On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
   cedric.villemain.deb...@gmail.com wrote:
   2011/1/5 Magnus Hagander mag...@hagander.net:
   On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine 
   dimi...@2ndquadrant.fr wrote:
   Magnus Hagander mag...@hagander.net writes:
   * Stefan mentiond it might be useful to put some
   posix_fadvise(POSIX_FADV_DONTNEED)
     in the process that streams all the files out. Seems useful, as 
   long as that
     doesn't kick them out of the cache *completely*, for other 
   backends as well.
     Do we know if that is the case?
  
   Maybe have a look at pgfincore to only tag DONTNEED for blocks that 
   are
   not already in SHM?
  
   I think that's way more complex than we want to go here.
  
  
   DONTNEED will remove the block from OS buffer everytime.
  
   Then we definitely don't want to use it - because some other backend
   might well want the file. Better leave it up to the standard logic in
   the kernel.
  
   Looking at the patch, it is (very) easy to add the support for that in
   basebackup.c
   That supposed allowing mincore(), so mmap(), and so probably switch
   the fopen() to an open() (or add an open() just for mmap
   requirement...)
  
   Let's go ?
 
  Per above, I still don't think we *should* do this. We don't want to
  kick things out of the cache underneath other backends, and since we
  can't control that. Either way, it shouldn't happen in the beginning,
  and if it does, should be backed with proper benchmarks.
 
  Another option that occurs to me is an option to use direct IO (or another
  means as needed) to bypass the cache.  So rather than kicking it out of
  the cache, we attempt just not to pollute the cache by bypassing it for 
  cold
  pages and use either normal io for 'hot pages', or use a 'read()' to heat
  the cache afterward.

 AFAIR, even Linus is rejecting the idea to use it seriously, except if
 I shuffle in my memory.

 Direct IO is generally a pain.

 POSIX_FADV_NOREUSE is an alternative (I think).  Realistically I wasn't sure 
 which
 way(s) actually worked.  My gut was that direct io would likely work right on 
 Linux
 and Solaris, at least.  If POSIX_FADV_NOREUSE works than maybe that is the 
 answer
 instead, but I haven't tested either.

yes it should be the best option, unfortunely it is a ghost flag, it
doesn't do anythig.
At some point there were a libprefetch library and a linux fincore()
syscall in the air. Unfortunely actors of those items stop
communication with open source afais. (I didn't get answers myself,
neither linux ML get ones.)



 Garick



 
  Garick
 
 
  I've committed the backend side of this, without that. Still working
  on the client, and on cleaning up Heikki's patch for grammar/parser
  support.
 
  --
   Magnus Hagander
   Me: http://www.hagander.net/
   Work: http://www.redpill-linpro.com/
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 



 --
 Cédric Villemain               2ndQuadrant
 http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Tom Lane

Florian Pflug f...@phlo.org writes:
 On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
 My gut was that direct io would likely work right on Linux
 and Solaris, at least.

 Didn't we discover recently that O_DIRECT fails for ext4 on linux
 if ordered=data, or something like that?

Quite.  Blithe assertions that something like this should work aren't
worth the electrons they're written on.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-11 Thread Garick Hamlin

On Tue, Jan 11, 2011 at 12:45:02PM -0500, Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
  On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
  My gut was that direct io would likely work right on Linux
  and Solaris, at least.
 
  Didn't we discover recently that O_DIRECT fails for ext4 on linux
  if ordered=data, or something like that?
 
 Quite.  Blithe assertions that something like this should work aren't
 worth the electrons they're written on.

Indeed.  I wasn't making such a claim in case that wasn't clear.  I believe,
in fact, there is no single way that will work everywhere.  This isn't
needed for correctness of course, it is merely a tweak for performance as
long as the 'not working case' on platform + filesystem X case degrades to
something close to what would have happened if we didn't try.  I expected
POSIX_FADV_NOREUSE not to work on Linux, but haven't looked at it recently
and not all systems are Linux so I mentioned it.  This was why I thought
direct io might be more realistic.

I did not have a chance to test before I wrote this email so I attempted to 
make my uncertainty clear.  I _know_ it will not work in some environments,
but I thought it was worth looking at if it worked on more than one sane 
common setup, but I can understand if you feel differently about that.

Garick

 
   regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-10 Thread Magnus Hagander

On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 2011/1/7 Magnus Hagander mag...@hagander.net:
 On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/1/5 Magnus Hagander mag...@hagander.net:
 On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:
 Magnus Hagander mag...@hagander.net writes:
 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long 
 as that
   doesn't kick them out of the cache *completely*, for other backends as 
 well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

 I think that's way more complex than we want to go here.


 DONTNEED will remove the block from OS buffer everytime.

 Then we definitely don't want to use it - because some other backend
 might well want the file. Better leave it up to the standard logic in
 the kernel.

 Looking at the patch, it is (very) easy to add the support for that in
 basebackup.c
 That supposed allowing mincore(), so mmap(), and so probably switch
 the fopen() to an open() (or add an open() just for mmap
 requirement...)

 Let's go ?

Per above, I still don't think we *should* do this. We don't want to
kick things out of the cache underneath other backends, and since we
can't control that. Either way, it shouldn't happen in the beginning,
and if it does, should be backed with proper benchmarks.

I've committed the backend side of this, without that. Still working
on the client, and on cleaning up Heikki's patch for grammar/parser
support.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-10 Thread Cédric Villemain

2011/1/10 Magnus Hagander mag...@hagander.net:
 On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/1/7 Magnus Hagander mag...@hagander.net:
 On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/1/5 Magnus Hagander mag...@hagander.net:
 On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:
 Magnus Hagander mag...@hagander.net writes:
 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long 
 as that
   doesn't kick them out of the cache *completely*, for other backends 
 as well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

 I think that's way more complex than we want to go here.


 DONTNEED will remove the block from OS buffer everytime.

 Then we definitely don't want to use it - because some other backend
 might well want the file. Better leave it up to the standard logic in
 the kernel.

 Looking at the patch, it is (very) easy to add the support for that in
 basebackup.c
 That supposed allowing mincore(), so mmap(), and so probably switch
 the fopen() to an open() (or add an open() just for mmap
 requirement...)

 Let's go ?

 Per above, I still don't think we *should* do this. We don't want to
 kick things out of the cache underneath other backends, and since we

we are dropping stuff underneath other backends  anyway but I
understand your point.

 can't control that. Either way, it shouldn't happen in the beginning,
 and if it does, should be backed with proper benchmarks.

I agree.


 I've committed the backend side of this, without that. Still working
 on the client, and on cleaning up Heikki's patch for grammar/parser
 support.

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-10 Thread Stefan Kaltenbrunner


On 01/10/2011 08:13 PM, Cédric Villemain wrote:

2011/1/10 Magnus Hagandermag...@hagander.net:

On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
cedric.villemain.deb...@gmail.com  wrote:

2011/1/7 Magnus Hagandermag...@hagander.net:

On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
cedric.villemain.deb...@gmail.com  wrote:

2011/1/5 Magnus Hagandermag...@hagander.net:

On Wed, Jan 5, 2011 at 22:58, Dimitri Fontainedimi...@2ndquadrant.fr  wrote:

Magnus Hagandermag...@hagander.net  writes:

* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as that
   doesn't kick them out of the cache *completely*, for other backends as well.
   Do we know if that is the case?


Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?


I think that's way more complex than we want to go here.



DONTNEED will remove the block from OS buffer everytime.


Then we definitely don't want to use it - because some other backend
might well want the file. Better leave it up to the standard logic in
the kernel.


Looking at the patch, it is (very) easy to add the support for that in
basebackup.c
That supposed allowing mincore(), so mmap(), and so probably switch
the fopen() to an open() (or add an open() just for mmap
requirement...)

Let's go ?


Per above, I still don't think we *should* do this. We don't want to
kick things out of the cache underneath other backends, and since we


we are dropping stuff underneath other backends  anyway but I
understand your point.


can't control that. Either way, it shouldn't happen in the beginning,
and if it does, should be backed with proper benchmarks.


I agree.


well I want to point out that the link I provided upthread actually 
provides a (linux centric) way to do get the property of interest for this:


* if the datablocks are in the OS buffercache just leave them alone, if 
the are NOT tell the OS that this current user is not interested in 
having it there


I would like to see something like that implemented in the backend 
sometime and maybe even as a guc of some sort, that way we actually 
could use that for say a pg_dump run as well, I have seen the 
responsetimes of big boxes tank not because of the CPU and lock-load 
pg_dump imposes but because of the way that it can cause the 
OS-buffercache to get spoiled with not-really-important data.




anyway I agree that the (positive and/or negative) effect of something 
like that needs to be measured but this effect is not too easy to see in 
very simple setups...



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-10 Thread Cédric Villemain

2011/1/10 Stefan Kaltenbrunner ste...@kaltenbrunner.cc:
 On 01/10/2011 08:13 PM, Cédric Villemain wrote:

 2011/1/10 Magnus Hagandermag...@hagander.net:

 On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
 cedric.villemain.deb...@gmail.com  wrote:

 2011/1/7 Magnus Hagandermag...@hagander.net:

 On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
 cedric.villemain.deb...@gmail.com  wrote:

 2011/1/5 Magnus Hagandermag...@hagander.net:

 On Wed, Jan 5, 2011 at 22:58, Dimitri
 Fontainedimi...@2ndquadrant.fr  wrote:

 Magnus Hagandermag...@hagander.net  writes:

 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as
 long as that
   doesn't kick them out of the cache *completely*, for other
 backends as well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that
 are
 not already in SHM?

 I think that's way more complex than we want to go here.


 DONTNEED will remove the block from OS buffer everytime.

 Then we definitely don't want to use it - because some other backend
 might well want the file. Better leave it up to the standard logic in
 the kernel.

 Looking at the patch, it is (very) easy to add the support for that in
 basebackup.c
 That supposed allowing mincore(), so mmap(), and so probably switch
 the fopen() to an open() (or add an open() just for mmap
 requirement...)

 Let's go ?

 Per above, I still don't think we *should* do this. We don't want to
 kick things out of the cache underneath other backends, and since we

 we are dropping stuff underneath other backends  anyway but I
 understand your point.

 can't control that. Either way, it shouldn't happen in the beginning,
 and if it does, should be backed with proper benchmarks.

 I agree.

 well I want to point out that the link I provided upthread actually provides
 a (linux centric) way to do get the property of interest for this:

yes, it is exactly what we are talking about here.
mincore and posix_fadvise.

freeBSD should allow that later, at least it is in the todo list
Windows may allow that too with different API.


 * if the datablocks are in the OS buffercache just leave them alone, if the
 are NOT tell the OS that this current user is not interested in having it
 there

my experience is that posix_fadvise on a specific block behave more
brutaly than flaging a whole file. In the later case it may not do
what you want if it estimates it is not welcome (because of other IO
request)

What Magnus point out is that other backends execute queries and
request blocks (and load them in shared buffers of postgresql) and it
is *hard* to be sure we don't remove blocks just loaded by another
backend ( the worst case beeing flushing prefeteched blocks not yet in
shared buffers, cf effective_io_concurrency )


 I would like to see something like that implemented in the backend sometime
 and maybe even as a guc of some sort, that way we actually could use that
 for say a pg_dump run as well, I have seen the responsetimes of big boxes
 tank not because of the CPU and lock-load pg_dump imposes but because of the
 way that it can cause the OS-buffercache to get spoiled with
 not-really-important data.

Glad to here that, pgfincore is also a POC about those topics.
The best solution is to mmap in postgres, but it is not posible, so we
have to do snapshot of objects and restore them afterwards (again *it
is* what tobias do with is rsync). Side note : because of readahead,
inspect block by block while you read the file provide bad results (or
you need to fadvise POSIX_FADV_RANDOM to remove readahead behavior,
which is not good at all).


 anyway I agree that the (positive and/or negative) effect of something like
 that needs to be measured but this effect is not too easy to see in very
 simple setups...

yes. and with pgbase_backup, copying 1GB over the network is longer
than  2 seconds, we will probably need to have a specific strategy.


-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-10 Thread Cédric Villemain

2011/1/10 Magnus Hagander mag...@hagander.net:
 On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/1/7 Magnus Hagander mag...@hagander.net:
 On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/1/5 Magnus Hagander mag...@hagander.net:
 On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:
 Magnus Hagander mag...@hagander.net writes:
 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long 
 as that
   doesn't kick them out of the cache *completely*, for other backends 
 as well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

 I think that's way more complex than we want to go here.


 DONTNEED will remove the block from OS buffer everytime.

 Then we definitely don't want to use it - because some other backend
 might well want the file. Better leave it up to the standard logic in
 the kernel.

 Looking at the patch, it is (very) easy to add the support for that in
 basebackup.c
 That supposed allowing mincore(), so mmap(), and so probably switch
 the fopen() to an open() (or add an open() just for mmap
 requirement...)

 Let's go ?

 Per above, I still don't think we *should* do this. We don't want to
 kick things out of the cache underneath other backends, and since we
 can't control that. Either way, it shouldn't happen in the beginning,
 and if it does, should be backed with proper benchmarks.

 I've committed the backend side of this, without that. Still working
 on the client, and on cleaning up Heikki's patch for grammar/parser
 support.

attached is a small patch fixing -d basedir when its called with an
absolute path.
maybe we can use pg_mkdir_p() instead of mkdir ?


 --
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 098f330..149a2ff 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -257,11 +257,6 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
 	 */
 	verify_dir_is_empty_or_create(current_path);
 
-	if (current_path[0] == '/')
-	{
-		current_path[0] = '_';
-	}
-
 	/*
 	 * Get the COPY data
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-09 Thread Hannu Krosing


On 7.1.2011 15:45, Magnus Hagander wrote:

On Fri, Jan 7, 2011 at 02:15, Simon Riggssi...@2ndquadrant.com  wrote:


One very useful feature will be some way of confirming the number and
size of files to transfer, so that the base backup client can find out
the progress.

The patch already does this. Or rather, as it's coded it does this
once per tablespace.

It'll give you an approximation only of course, that can change,
In this case you actually could send exact numbers, as you need to only 
transfer the files
 up to the size they were when starting the base backup. The rest will 
be taken care of by

 WAL replay


  but
it should be enough for the purposes of a progress indication.



It would also be good to avoid writing a backup_label file at all on the
master, so there was no reason why multiple concurrent backups could not
be taken. The current coding allows for the idea that the start and stop
might be in different sessions, whereas here we know we are in one
session.

Yeah, I have that on the todo list suggested by Heikki. I consider it
a later phase though.





--

Hannu Krosing
Senior Consultant,
Infinite Scalability  Performance
http://www.2ndQuadrant.com/books/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-09 Thread Magnus Hagander

On Sun, Jan 9, 2011 at 09:55, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 7.1.2011 15:45, Magnus Hagander wrote:

 On Fri, Jan 7, 2011 at 02:15, Simon Riggssi...@2ndquadrant.com  wrote:

 One very useful feature will be some way of confirming the number and
 size of files to transfer, so that the base backup client can find out
 the progress.

 The patch already does this. Or rather, as it's coded it does this
 once per tablespace.

 It'll give you an approximation only of course, that can change,

 In this case you actually could send exact numbers, as you need to only
 transfer the files
  up to the size they were when starting the base backup. The rest will be
 taken care of by
  WAL replay

It will still be an estimate, because files can get smaller, and even
go away completely.

But we really don't need more than an estimate...


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-09 Thread Hannu Krosing


On 9.1.2011 10:44, Magnus Hagander wrote:

On Sun, Jan 9, 2011 at 09:55, Hannu Krosingha...@2ndquadrant.com  wrote:

On 7.1.2011 15:45, Magnus Hagander wrote:

On Fri, Jan 7, 2011 at 02:15, Simon Riggssi...@2ndquadrant.comwrote:


One very useful feature will be some way of confirming the number and
size of files to transfer, so that the base backup client can find out
the progress.

The patch already does this. Or rather, as it's coded it does this
once per tablespace.

It'll give you an approximation only of course, that can change,

In this case you actually could send exact numbers, as you need to only
transfer the files
  up to the size they were when starting the base backup. The rest will be
taken care of by
  WAL replay

It will still be an estimate, because files can get smaller, and even
go away completely.
Sure. Just wanted to remind the fact that you don't need to send the 
tail part of the

file which was added after the start of backup.

And you can give the worst case estimate for space needed by base backup.

OTOH, streaming the WAL files in parallel can still fill up all 
available space :P



But we really don't need more than an estimate...


Agreed.

--

Hannu Krosing
Senior Consultant,
Infinite Scalability  Performance
http://www.2ndQuadrant.com/books/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-09 Thread Magnus Hagander

On Sun, Jan 9, 2011 at 12:05, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 9.1.2011 10:44, Magnus Hagander wrote:

 On Sun, Jan 9, 2011 at 09:55, Hannu Krosingha...@2ndquadrant.com  wrote:

 On 7.1.2011 15:45, Magnus Hagander wrote:

 On Fri, Jan 7, 2011 at 02:15, Simon Riggssi...@2ndquadrant.com
  wrote:

 One very useful feature will be some way of confirming the number and
 size of files to transfer, so that the base backup client can find out
 the progress.

 The patch already does this. Or rather, as it's coded it does this
 once per tablespace.

 It'll give you an approximation only of course, that can change,

 In this case you actually could send exact numbers, as you need to only
 transfer the files
  up to the size they were when starting the base backup. The rest will be
 taken care of by
  WAL replay

 It will still be an estimate, because files can get smaller, and even
 go away completely.

 Sure. Just wanted to remind the fact that you don't need to send the tail
 part of the
 file which was added after the start of backup.

True - but that's a PITA to keep track of. We do this if the file
changes during the transmission of that *file*, since otherwise the
tar header would specify an incorrect size, but not through the whole
backup.


 And you can give the worst case estimate for space needed by base backup.

 OTOH, streaming the WAL files in parallel can still fill up all available
 space :P

Yeah. I don't think it's worth the extra complexity of having to
enumerate and keep records for every individual file before the
streaming starts.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-09 Thread Cédric Villemain

2011/1/7 Garick Hamlin gham...@isc.upenn.edu:
 On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
 2011/1/5 Magnus Hagander mag...@hagander.net:
  On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
  wrote:
  Magnus Hagander mag...@hagander.net writes:
  * Stefan mentiond it might be useful to put some
  posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long 
  as that
    doesn't kick them out of the cache *completely*, for other backends as 
  well.
    Do we know if that is the case?
 
  Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
  not already in SHM?
 
  I think that's way more complex than we want to go here.
 

 DONTNEED will remove the block from OS buffer everytime.

 It should not be that hard to implement a snapshot(it needs mincore())
 and to restore previous state. I don't know how basebackup is
 performed exactly...so perhaps I am wrong.

 posix_fadvise support is already in postgresql core...we can start by
 just doing a snapshot of the files before starting, or at some point
 in the basebackup, it will need only 256kB per GB of data...

 It is actually possible to be more scalable than the simple solution you
 outline here (although that solution works pretty well).

Yes I suggest something pretty simple to go with a first shoot.


 I've written a program that syncronizes the OS cache state using
 mmap()/mincore() between two computers.  It haven't actually tested its
 impact on performance yet, but I was surprised by how fast it actually runs
 and how compact cache maps can be.

 If one encodes the data so one remembers the number of zeros between 1s
 one, storage scale by the amount of memory in each size rather than the
 dataset size.  I actually played with doing that, then doing huffman
 encoding of that.  I get around 1.2-1.3 bits / page of _physical memory_
 on my tests.

 I don't have my notes handy, but here are some numbers from memory...

that is interesting, even if I didn't have issue with the size of the
maps so far, I thought that a simple zlib compression should be
enought.


 The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
 of physical memory in the machine.  The latter limit get better, however,
 since there are  1024 symbols possible for the encoder (since in this
 case symbols are spans of zeros that need to fit in a file that is 1 GB in
 size).  So is actually real worst case is much closer to 1 bit per page of
 the dataset or ~10 bits per page of physical memory.  The real performance
 I see with huffman is more like 1.3 bits per page of physical memory.  All the
 encoding decoding is actually very fast.  zlib would actually compress even
 better than huffman, but huffman encoder/decoder is actually pretty good and
 very straightforward code.

pgfincore currently hold those information in flat file. The on-going
dev is more simple and provide the data as bits, so you can store it
in a table, and restore it on your slave thanks to SR, and use it on
the slave.


 I would like to integrate something like this into PG or perhaps even into
 something like rsync, but its was written as proof of concept and I haven't
 had time work on it recently.

 Garick

 --
 Cédric Villemain               2ndQuadrant
 http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-09 Thread Cédric Villemain

2011/1/7 Magnus Hagander mag...@hagander.net:
 On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/1/5 Magnus Hagander mag...@hagander.net:
 On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:
 Magnus Hagander mag...@hagander.net writes:
 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as 
 that
   doesn't kick them out of the cache *completely*, for other backends as 
 well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

 I think that's way more complex than we want to go here.


 DONTNEED will remove the block from OS buffer everytime.

 Then we definitely don't want to use it - because some other backend
 might well want the file. Better leave it up to the standard logic in
 the kernel.

Looking at the patch, it is (very) easy to add the support for that in
basebackup.c
That supposed allowing mincore(), so mmap(), and so probably switch
the fopen() to an open() (or add an open() just for mmap
requirement...)

Let's go ?


 It should not be that hard to implement a snapshot(it needs mincore())
 and to restore previous state. I don't know how basebackup is
 performed exactly...so perhaps I am wrong.

 Uh, it just reads the files out of the filesystem. Just like you'd to
 today, except it's now integrated and streams the data across a
 regular libpq connection.

 --
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-08 Thread Magnus Hagander

On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
 because they're not included in the streamed tar. Wouldn't it be better to
 include them in the tar as empty directories at the server-side? Otherwise
 if you write the tar file to disk and untar it later, you have to manually
 create them.

Attached is an updated patch that does this.

It also collects all the header records as a single resultset at the
beginning. This made for cleaner code, but more importantly makes it
possible to get the total size of the backup even if there are
multiple tablespaces.

It also changes the tar members to use relative paths instead of
absolute ones - since we send the root of the directory in the header
anyway. That also takes away the ./ portion in all tar members.

git branch on github updated as well, of course.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 1458,1463  The commands accepted in walsender mode are:
--- 1458,1555 
   /para
  /listitem
/varlistentry
+ 
+   varlistentry
+ termBASE_BACKUP replaceableoptions/literal;/replaceablelabel//term
+ listitem
+  para
+   Instructs the server to start streaming a base backup.
+   The system will automatically be put in backup mode with the label
+   specified in replaceablelabel/ before the backup is started, and
+   taken out of it when the backup is complete. The following options
+   are accepted:
+   variablelist
+varlistentry
+ termliteralPROGRESS//term
+ listitem
+  para
+   Request information required to generate a progress report. This will
+   send back an approximate size in the header of each tablespace, which
+   can be used to calculate how far along the stream is done. This is
+   calculated by enumerating all the file sizes once before the transfer
+   is even started, and may as such have a negative impact on the
+   performance - in particular it may take longer before the first data
+   is streamed. Since the database files can change during the backup,
+   the size is only approximate and may both grow and shrink between
+   the time of approximation and the sending of the actual files.
+  /para
+ /listitem
+/varlistentry
+   /variablelist
+  /para
+  para
+   When the backup is started, the server will first send a header in
+   ordinary result set format, followed by one or more CopyResponse
+   results, one for PGDATA and one for each additional tablespace other
+   than literalpg_default/ and literalpg_global/. The data in
+   the CopyResponse results will be a tar format (using ustar00
+   extensions) dump of the tablespace contents.
+  /para
+  para
+   The header is an ordinary resultset with one row for each tablespace.
+   The fields in this row are:
+   variablelist
+varlistentry
+ termspcoid/term
+ listitem
+  para
+   The oid of the tablespace, or literalNULL/ if it's the base
+   directory.
+  /para
+ /listitem
+/varlistentry
+varlistentry
+ termspclocation/term
+ listitem
+  para
+   The full path of the tablespace directory, or literalNULL/
+   if it's the base directory.
+  /para
+ /listitem
+/varlistentry
+varlistentry
+ termsize/term
+ listitem
+  para
+   The approximate size of the datablock, if progress report has
+   been requested; otherwise it's literalNULL/.
+  /para
+ /listitem
+/varlistentry
+   /variablelist
+  /para
+  para
+   The tar archive for the data directory and each tablespace will contain
+   all files in the directories, regardless of whether they are
+   productnamePostgreSQL/ files or other files added to the same
+   directory. The only excluded files are:
+   itemizedlist spacing=compact mark=bullet
+listitem
+ para
+  filenamepostmaster.pid/
+ /para
+/listitem
+listitem
+ para
+  filenamepg_xlog/ (including subdirectories)
+ /para
+/listitem
+   /itemizedlist
+   Owner, group and file mode are set if the underlying filesystem on
+   the server supports it.
+  /para
+ /listitem
+   /varlistentry
  /variablelist
  
  /para
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 8308,8313  pg_start_backup(PG_FUNCTION_ARGS)
--- 8308,8328 
  	text	   *backupid = PG_GETARG_TEXT_P(0);
  	bool		fast = PG_GETARG_BOOL(1);
  	char

Re: [HACKERS] Streaming base backups

2011-01-07 Thread Magnus Hagander

On Fri, Jan 7, 2011 at 02:15, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, 2011-01-05 at 14:54 +0100, Magnus Hagander wrote:

 The basic implementation is: Add a new command to the replication mode called
 BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
 compatible format) of the data directory and all tablespaces, and then end
 the base backup in a single operation.

 I'm a little dubious of the performance of that approach for some users,
 though it does seem a popular idea.

Well, it's of course only going to be an *option*. We should keep our
flexibility and allow the current ways as well.


 One very useful feature will be some way of confirming the number and
 size of files to transfer, so that the base backup client can find out
 the progress.

The patch already does this. Or rather, as it's coded it does this
once per tablespace.

It'll give you an approximation only of course, that can change, but
it should be enough for the purposes of a progress indication.


 It would also be good to avoid writing a backup_label file at all on the
 master, so there was no reason why multiple concurrent backups could not
 be taken. The current coding allows for the idea that the start and stop
 might be in different sessions, whereas here we know we are in one
 session.

Yeah, I have that on the todo list suggested by Heikki. I consider it
a later phase though.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-07 Thread Magnus Hagander

On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 2011/1/5 Magnus Hagander mag...@hagander.net:
 On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:
 Magnus Hagander mag...@hagander.net writes:
 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as 
 that
   doesn't kick them out of the cache *completely*, for other backends as 
 well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

 I think that's way more complex than we want to go here.


 DONTNEED will remove the block from OS buffer everytime.

Then we definitely don't want to use it - because some other backend
might well want the file. Better leave it up to the standard logic in
the kernel.

 It should not be that hard to implement a snapshot(it needs mincore())
 and to restore previous state. I don't know how basebackup is
 performed exactly...so perhaps I am wrong.

Uh, it just reads the files out of the filesystem. Just like you'd to
today, except it's now integrated and streams the data across a
regular libpq connection.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-07 Thread Garick Hamlin

On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
 2011/1/5 Magnus Hagander mag...@hagander.net:
  On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
  wrote:
  Magnus Hagander mag...@hagander.net writes:
  * Stefan mentiond it might be useful to put some
  posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as 
  that
    doesn't kick them out of the cache *completely*, for other backends as 
  well.
    Do we know if that is the case?
 
  Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
  not already in SHM?
 
  I think that's way more complex than we want to go here.
 
 
 DONTNEED will remove the block from OS buffer everytime.
 
 It should not be that hard to implement a snapshot(it needs mincore())
 and to restore previous state. I don't know how basebackup is
 performed exactly...so perhaps I am wrong.
 
 posix_fadvise support is already in postgresql core...we can start by
 just doing a snapshot of the files before starting, or at some point
 in the basebackup, it will need only 256kB per GB of data...

It is actually possible to be more scalable than the simple solution you
outline here (although that solution works pretty well).  

I've written a program that syncronizes the OS cache state using
mmap()/mincore() between two computers.  It haven't actually tested its
impact on performance yet, but I was surprised by how fast it actually runs
and how compact cache maps can be.

If one encodes the data so one remembers the number of zeros between 1s 
one, storage scale by the amount of memory in each size rather than the 
dataset size.  I actually played with doing that, then doing huffman 
encoding of that.  I get around 1.2-1.3 bits / page of _physical memory_ 
on my tests.

I don't have my notes handy, but here are some numbers from memory...

The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
of physical memory in the machine.  The latter limit get better, however,
since there are  1024 symbols possible for the encoder (since in this 
case symbols are spans of zeros that need to fit in a file that is 1 GB in
size).  So is actually real worst case is much closer to 1 bit per page of 
the dataset or ~10 bits per page of physical memory.  The real performance
I see with huffman is more like 1.3 bits per page of physical memory.  All the 
encoding decoding is actually very fast.  zlib would actually compress even 
better than huffman, but huffman encoder/decoder is actually pretty good and
very straightforward code.

I would like to integrate something like this into PG or perhaps even into
something like rsync, but its was written as proof of concept and I haven't 
had time work on it recently.

Garick

 -- 
 Cédric Villemain               2ndQuadrant
 http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-07 Thread Garick Hamlin

On Fri, Jan 07, 2011 at 10:26:29AM -0500, Garick Hamlin wrote:
 On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
  2011/1/5 Magnus Hagander mag...@hagander.net:
   On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr 
   wrote:
   Magnus Hagander mag...@hagander.net writes:
   * Stefan mentiond it might be useful to put some
   posix_fadvise(POSIX_FADV_DONTNEED)
     in the process that streams all the files out. Seems useful, as long 
   as that
     doesn't kick them out of the cache *completely*, for other backends 
   as well.
     Do we know if that is the case?
  
   Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
   not already in SHM?
  
   I think that's way more complex than we want to go here.
  
  
  DONTNEED will remove the block from OS buffer everytime.
  
  It should not be that hard to implement a snapshot(it needs mincore())
  and to restore previous state. I don't know how basebackup is
  performed exactly...so perhaps I am wrong.
  
  posix_fadvise support is already in postgresql core...we can start by
  just doing a snapshot of the files before starting, or at some point
  in the basebackup, it will need only 256kB per GB of data...
 
 It is actually possible to be more scalable than the simple solution you
 outline here (although that solution works pretty well).  
 
 I've written a program that syncronizes the OS cache state using
 mmap()/mincore() between two computers.  It haven't actually tested its
 impact on performance yet, but I was surprised by how fast it actually runs
 and how compact cache maps can be.
 
 If one encodes the data so one remembers the number of zeros between 1s 
 one, storage scale by the amount of memory in each size rather than the 

Sorry for the typos, that should read:

the storage scales by the number of pages resident in memory rather than the 
total dataset size.

 dataset size.  I actually played with doing that, then doing huffman 
 encoding of that.  I get around 1.2-1.3 bits / page of _physical memory_ 
 on my tests.
 
 I don't have my notes handy, but here are some numbers from memory...
 
 The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
 of physical memory in the machine.  The latter limit get better, however,
 since there are  1024 symbols possible for the encoder (since in this 
 case symbols are spans of zeros that need to fit in a file that is 1 GB in
 size).  So is actually real worst case is much closer to 1 bit per page of 
 the dataset or ~10 bits per page of physical memory.  The real performance
 I see with huffman is more like 1.3 bits per page of physical memory.  All 
 the 
 encoding decoding is actually very fast.  zlib would actually compress even 
 better than huffman, but huffman encoder/decoder is actually pretty good and
 very straightforward code.
 
 I would like to integrate something like this into PG or perhaps even into
 something like rsync, but its was written as proof of concept and I haven't 
 had time work on it recently.
 
 Garick
 
  -- 
  Cédric Villemain               2ndQuadrant
  http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  
  -- 
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-07 Thread Heikki Linnakangas


On 05.01.2011 15:54, Magnus Hagander wrote:

* Suggestion from Heikki: perhaps at some point we're going to need a full
   bison grammar for walsender commands.


Here's a patch for this (Also available at 
g...@github.com:hlinnaka/postgres.git, branch streaming_base). I 
thought I know our bison/flex magic pretty well by now, but it turned 
out to take much longer than I thought. But here it is.


I'm not 100% sure if this is worth the trouble quite yet. It adds quite 
a lot of boilerplate code.. OTOH, having a bison grammar file makes it 
easier to see what exactly the grammar is, and I like that. It's not too 
bad with three commands yet, but if it expands much further a bison 
grammar is a must.


At first I tried using the backend lexer for this, but it couldn't parse 
the xlog-start location in the START_REPLICATION 0/4700 command. 
In hindsight that may have been a badly chosen syntax. But as you 
pointed out on IM, the lexer needed to handle this limited set of 
commands is very small, so I wrote a dedicated flex lexer instead that 
can handle it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/replication/Makefile
--- b/src/backend/replication/Makefile
***
*** 12,17  subdir = src/backend/replication
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o
  
  include $(top_srcdir)/src/backend/common.mk
--- 12,40 
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
! 	repl_gram.o
  
  include $(top_srcdir)/src/backend/common.mk
+ 
+ # repl_scanner is compiled as part of repl_gram
+ repl_gram.o: repl_scanner.c
+ 
+ # See notes in src/backend/parser/Makefile about the following two rules
+ 
+ repl_gram.c: repl_gram.y
+ ifdef BISON
+ 	$(BISON) -d $(BISONFLAGS) -o $@ $
+ else
+ 	@$(missing) bison $ $@
+ endif
+ 
+ repl_scanner.c: repl_scanner.l
+ ifdef FLEX
+ 	$(FLEX) $(FLEXFLAGS) -o'$@' $
+ else
+ 	@$(missing) flex $ $@
+ endif
+ 
+ # repl_gram.c and repl_scanner.c are in the distribution tarball, so
+ # they are not cleaned here.
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***
*** 56,81  base_backup_cleanup(int code, Datum arg)
   * CopyOut format.
   */
  void
! SendBaseBackup(const char *options)
  {
  	DIR 		   *dir;
  	struct dirent  *de;
- 	char   		   *backup_label = strchr(options, ';');
- 	bool			progress = false;
- 
- 	if (backup_label == NULL)
- 		ereport(FATAL,
- (errcode(ERRCODE_PROTOCOL_VIOLATION),
-  errmsg(invalid base backup options: %s, options)));
- 	backup_label++; /* Walk past the semicolon */
- 
- 	/* Currently the only option string supported is PROGRESS */
- 	if (strncmp(options, PROGRESS, 8) == 0)
- 		progress = true;
- 	else if (options[0] != ';')
- 		ereport(FATAL,
- (errcode(ERRCODE_PROTOCOL_VIOLATION),
-  errmsg(invalid base backup options: %s, options)));
  
  	/* Make sure we can open the directory with tablespaces in it */
  	dir = AllocateDir(pg_tblspc);
--- 56,65 
   * CopyOut format.
   */
  void
! SendBaseBackup(const char *backup_label, bool progress)
  {
  	DIR 		   *dir;
  	struct dirent  *de;
  
  	/* Make sure we can open the directory with tablespaces in it */
  	dir = AllocateDir(pg_tblspc);
*** /dev/null
--- b/src/backend/replication/repl_gram.y
***
*** 0 
--- 1,135 
+ %{
+ /*-
+  *
+  * repl_gram.y- Parser for the replication commands
+  *
+  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/replication/repl_gram.y
+  *
+  *-
+  */
+ 
+ #include postgres.h
+ 
+ #include nodes/nodes.h
+ #include nodes/replnodes.h
+ #include replication/walsender.h
+ 
+ /* Result of the parsing is returned here */
+ Node *replication_parse_result;
+ 
+ /* Location tracking support --- simpler than bison's default */
+ #define YYLLOC_DEFAULT(Current, Rhs, N) \
+ 	do { \
+ 		if (N) \
+ 			(Current) = (Rhs)[1]; \
+ 		else \
+ 			(Current) = (Rhs)[0]; \
+ 	} while (0)
+ 
+ /*
+  * Bison doesn't allocate anything that needs to live across parser calls,
+  * so we can easily have it use palloc instead of malloc.  This prevents
+  * memory leaks if we error out during parsing.  Note this only works with
+  * bison = 2.0.  However, in bison 1.875 the default is to use alloca()
+  * if possible, so there's not really much problem anyhow, at least if
+  * you're building with gcc.
+  */
+ #define YYMALLOC palloc
+ #define YYFREE   pfree
+ 
+ #define parser_yyerror(msg)  replication_yyerror(msg, yyscanner)
+ #define parser_errposition(pos)

Re: [HACKERS] Streaming base backups

2011-01-07 Thread Heikki Linnakangas


On 05.01.2011 15:54, Magnus Hagander wrote:

I've implemented a frontend for this in pg_streamrecv, based on the assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we want to.
That code needs a lot more cleanup, but I wanted to make sure I got the backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
in the branch baserecv. I'll be posting that as a separate patch once it's
been a bit more cleaned up (it does work now if you want to test it, though).


One more thing, now that I've played a bit with pg_streamrecv:

I find it strange that the data directory must exist when you call 
pg_streamrecv in base-backup mode. I would expect it to work like 
initdb, and create the directory if it doesn't exist.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-06 Thread Heikki Linnakangas


On 06.01.2011 00:27, Dimitri Fontaine wrote:

Magnus Hagandermag...@hagander.net  writes:

What about pg_streamrecv | gzip  …, which has the big advantage of


That's part of what I meant with easier and more useful.


Well…


One thing to keep in mind is that if you do compression in libpq for the 
transfer, and gzip the tar file in the client, that's quite inefficient. 
You compress the data once in the server, decompress in the client, then 
compress it again in the client.  If you're going to write the backup to 
a compressed file, and you want to transfer it compressed to save 
bandwidth, you want to gzip it in the server to begin with.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-06 Thread Marti Raudsepp

On Wed, Jan 5, 2011 at 23:58, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as 
 that
   doesn't kick them out of the cache *completely*, for other backends as 
 well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

It's not much of an improvement. For pages that we already have in
shared memory, OS cache is mostly useless. OS cache matters for pages
that *aren't* in shared memory.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-06 Thread Magnus Hagander

On Wed, Jan 5, 2011 at 23:27, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Magnus Hagander mag...@hagander.net writes:

 Well, I would guess that if you're streaming the WAL files in parallel
 while the base backup is taken, then you're able to have it all without
 archiving setup, and the server could still recycling them.

 Yes, this was mostly for the use-case of getting a single tarfile
 that you can actually use to restore from without needing the log
 archive at all.

 It also allows for a simpler kick-start procedure for preparing a
 standby, and allows to stop worrying too much about wal_keep_segments
 and archive servers.

 When do the standby launch its walreceiver? It would be extra-nice for
 the base backup tool to optionally continue streaming WALs until the
 standby starts doing it itself, so that wal_keep_segments is really
 deprecated.  No idea how feasible that is, though.

I think that's we're inventing a whole lot of complexity that may not
be necessary at all. Let's do it the simple way and see how far we can
get by with that one - we can always improve this for 9.2

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-06 Thread Heikki Linnakangas


On 05.01.2011 15:54, Magnus Hagander wrote:

Attached is an updated streaming base backup patch, based off the work
that Heikki started.
...
I've implemented a frontend for this in pg_streamrecv, based on the assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we want to.


Hmm, is there any point in keeping the two functionalities in the same 
binary, taking the base backup and streaming WAL to an archive 
directory? Looks like the only common option between the two modes is 
passing the connection string, and the verbose flag. A separate 
pg_basebackup binary would probably make more sense.



That code needs a lot more cleanup, but I wanted to make sure I got the backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
in the branch baserecv. I'll be posting that as a separate patch once it's
been a bit more cleaned up (it does work now if you want to test it, though).


Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories, 
because they're not included in the streamed tar. Wouldn't it be better 
to include them in the tar as empty directories at the server-side? 
Otherwise if you write the tar file to disk and untar it later, you have 
to manually create them.


It would be nice to have an option in pg_streamrecv to specify the 
backup label to use.


An option to stream the tar to stdout instead of a file would be very 
handy too, so that you could pipe it directly to gzip for example. I 
realize you get multiple tar files if tablespaces are used, but even if 
you just throw an error in that case, it would be handy.



* Suggestion from Heikki: perhaps at some point we're going to need a full
   bison grammar for walsender commands.


Maybe we should at least start using the lexer; we're not quite there to 
need a full-blown grammar yet, but even a lexer might help.



BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it 
from scratch with an empty target directory, it needs to connect to 
postgres database, to run pg_current_xlog_location(), and then 
reconnect in replication mode. That's a bit awkward, there might not be 
a postgres database, and even if there is, you might not have the 
permission to connect to it. It would be much better to have a variant 
of the START_REPLICATION command at the server-side that begins 
streaming from the current location. Maybe just by leaving out the 
start-location parameter.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-06 Thread Magnus Hagander

On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 05.01.2011 15:54, Magnus Hagander wrote:

 Attached is an updated streaming base backup patch, based off the work
 that Heikki started.
 ...
 I've implemented a frontend for this in pg_streamrecv, based on the
 assumption
 that we wanted to include this in bin/ for 9.1 - and that it seems like a
 reasonable place to put it. This can obviously be moved elsewhere if we
 want to.

 Hmm, is there any point in keeping the two functionalities in the same
 binary, taking the base backup and streaming WAL to an archive directory?
 Looks like the only common option between the two modes is passing the
 connection string, and the verbose flag. A separate pg_basebackup binary
 would probably make more sense.

Yeah, once I broke things apart for better readability, I started
leaning in that direction as well.

However, if you consider the things that Dimiti mentioned about
streaming at the same time as downloading, having them in the same one
would make more sense. I don't think that's something for now,
though..


 That code needs a lot more cleanup, but I wanted to make sure I got the
 backend
 patch out for review quickly. You can find the current WIP branch for
 pg_streamrecv on my github page at
 https://github.com/mhagander/pg_streamrecv,
 in the branch baserecv. I'll be posting that as a separate patch once
 it's
 been a bit more cleaned up (it does work now if you want to test it,
 though).

 Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
 because they're not included in the streamed tar. Wouldn't it be better to
 include them in the tar as empty directories at the server-side? Otherwise
 if you write the tar file to disk and untar it later, you have to manually
 create them.

Yeah, good point. Originally, the tar code (your tar code, btw :P)
didn't create *any* directories, so I stuck it in there. I agree it
should be moved to the backend patch now.


 It would be nice to have an option in pg_streamrecv to specify the backup
 label to use.

Agreed.


 An option to stream the tar to stdout instead of a file would be very handy
 too, so that you could pipe it directly to gzip for example. I realize you
 get multiple tar files if tablespaces are used, but even if you just throw
 an error in that case, it would be handy.

Makes sense.


 * Suggestion from Heikki: perhaps at some point we're going to need a full
   bison grammar for walsender commands.

 Maybe we should at least start using the lexer; we're not quite there to
 need a full-blown grammar yet, but even a lexer might help.

Might. I don't speak flex very well, so I'm not really sure what that
would mean.


 BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it
 from scratch with an empty target directory, it needs to connect to
 postgres database, to run pg_current_xlog_location(), and then reconnect
 in replication mode. That's a bit awkward, there might not be a postgres
 database, and even if there is, you might not have the permission to connect
 to it. It would be much better to have a variant of the START_REPLICATION
 command at the server-side that begins streaming from the current location.
 Maybe just by leaving out the start-location parameter.

Agreed. That part is unchanged from the one that runs against 9.0
though, where that wasn't a possibility. But adding something like
that to the walsender in 9.1 would be good.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-06 Thread Cédric Villemain

2011/1/5 Magnus Hagander mag...@hagander.net:
 On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Magnus Hagander mag...@hagander.net writes:
 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as 
 that
   doesn't kick them out of the cache *completely*, for other backends as 
 well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

 I think that's way more complex than we want to go here.


DONTNEED will remove the block from OS buffer everytime.

It should not be that hard to implement a snapshot(it needs mincore())
and to restore previous state. I don't know how basebackup is
performed exactly...so perhaps I am wrong.

posix_fadvise support is already in postgresql core...we can start by
just doing a snapshot of the files before starting, or at some point
in the basebackup, it will need only 256kB per GB of data...
-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-06 Thread Simon Riggs

On Wed, 2011-01-05 at 14:54 +0100, Magnus Hagander wrote:

 The basic implementation is: Add a new command to the replication mode called
 BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
 compatible format) of the data directory and all tablespaces, and then end
 the base backup in a single operation.

I'm a little dubious of the performance of that approach for some users,
though it does seem a popular idea.

One very useful feature will be some way of confirming the number and
size of files to transfer, so that the base backup client can find out
the progress.

It would also be good to avoid writing a backup_label file at all on the
master, so there was no reason why multiple concurrent backups could not
be taken. The current coding allows for the idea that the start and stop
might be in different sessions, whereas here we know we are in one
session.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-05 Thread Stefan Kaltenbrunner


On 01/05/2011 02:54 PM, Magnus Hagander wrote:
[..]

Some remaining thoughts and must-dos:

* Compression: Do we want to be able to compress the backups server-side? Or
   defer that to whenever we get compression in libpq? (you can still tunnel it
   through for example SSH to get compression if you want to) My thinking is
   defer it.
* Compression: We could still implement compression of the tar files in
   pg_streamrecv (probably easier, possibly more useful?)


hmm compression would be nice but I don't think it is required for this 
initial implementation.




* Windows support (need to implement readlink)
* Tar code is copied from pg_dump and modified. Should we try to factor it out
   into port/? There are changes in the middle of it so it can't be done with
   the current calling points, it would need a refactor. I think it's not worth
   it, given how simple it is.

Improvements I want to add, but that aren't required for basic operation:

* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as that
   doesn't kick them out of the cache *completely*, for other backends as well.
   Do we know if that is the case?


well my main concern is that a basebackup done that way might blew up 
the buffercache of the OS causing temporary performance issues.
This might be more serious with an in-core solution than with what 
people use now because a number of backup software and tools (like some 
of the commercial backup solutions) employ various tricks to avoid that.

One interesting tidbit i found was:

http://insights.oetiker.ch/linux/fadvise/

which is very Linux specific but interesting nevertheless...




Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-05 Thread Dimitri Fontaine

Magnus Hagander mag...@hagander.net writes:
 Attached is an updated streaming base backup patch, based off the work

Thanks! :)

 * Compression: Do we want to be able to compress the backups server-side? Or
   defer that to whenever we get compression in libpq? (you can still tunnel it
   through for example SSH to get compression if you want to) My thinking is
   defer it.

Compression in libpq would be a nice way to solve it, later.

 * Compression: We could still implement compression of the tar files in
   pg_streamrecv (probably easier, possibly more useful?)

What about pg_streamrecv | gzip  …, which has the big advantage of
being friendly to *any* compression command line tool, whatever patents
and licenses?

 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as that
   doesn't kick them out of the cache *completely*, for other backends as well.
   Do we know if that is the case?

Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?

 * include all the necessary WAL files in the backup. This way we could 
 generate
   a tar file that would work on it's own - right now, you still need to set up
   log archiving (or use streaming repl) to get the remaining logfiles from the
   master. This is fine for replication setups, but not for backups.
   This would also require us to block recycling of WAL files during the 
 backup,
   of course.

Well, I would guess that if you're streaming the WAL files in parallel
while the base backup is taken, then you're able to have it all without
archiving setup, and the server could still recycling them.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-05 Thread Magnus Hagander

On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Magnus Hagander mag...@hagander.net writes:
 Attached is an updated streaming base backup patch, based off the work

 Thanks! :)

 * Compression: Do we want to be able to compress the backups server-side? Or
   defer that to whenever we get compression in libpq? (you can still tunnel 
 it
   through for example SSH to get compression if you want to) My thinking is
   defer it.

 Compression in libpq would be a nice way to solve it, later.

Yeah, I'm pretty much set on postponing that one.


 * Compression: We could still implement compression of the tar files in
   pg_streamrecv (probably easier, possibly more useful?)

 What about pg_streamrecv | gzip  …, which has the big advantage of
 being friendly to *any* compression command line tool, whatever patents
 and licenses?

That's part of what I meant with easier and more useful.

Right now though, pg_streamrecv will output one tar file for each
tablespace, so you can't get it on stdout. But that can be changed of
course. The easiest step 1 is to just use gzopen() from zlib on the
files and use the same code as now :-)


 * Stefan mentiond it might be useful to put some
 posix_fadvise(POSIX_FADV_DONTNEED)
   in the process that streams all the files out. Seems useful, as long as 
 that
   doesn't kick them out of the cache *completely*, for other backends as 
 well.
   Do we know if that is the case?

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

I think that's way more complex than we want to go here.


 * include all the necessary WAL files in the backup. This way we could 
 generate
   a tar file that would work on it's own - right now, you still need to set 
 up
   log archiving (or use streaming repl) to get the remaining logfiles from 
 the
   master. This is fine for replication setups, but not for backups.
   This would also require us to block recycling of WAL files during the 
 backup,
   of course.

 Well, I would guess that if you're streaming the WAL files in parallel
 while the base backup is taken, then you're able to have it all without
 archiving setup, and the server could still recycling them.

Yes, this was mostly for the use-case of getting a single tarfile
that you can actually use to restore from without needing the log
archive at all.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-05 Thread Dimitri Fontaine

Magnus Hagander mag...@hagander.net writes:
 Compression in libpq would be a nice way to solve it, later.

 Yeah, I'm pretty much set on postponing that one.

+1, in case it was not clear for whoever's counting the votes :)

 What about pg_streamrecv | gzip  …, which has the big advantage of

 That's part of what I meant with easier and more useful.

Well…

 Right now though, pg_streamrecv will output one tar file for each
 tablespace, so you can't get it on stdout. But that can be changed of
 course. The easiest step 1 is to just use gzopen() from zlib on the
 files and use the same code as now :-)

Oh if integrating it is easier :)

 Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
 not already in SHM?

 I think that's way more complex than we want to go here.

Yeah.

 Well, I would guess that if you're streaming the WAL files in parallel
 while the base backup is taken, then you're able to have it all without
 archiving setup, and the server could still recycling them.

 Yes, this was mostly for the use-case of getting a single tarfile
 that you can actually use to restore from without needing the log
 archive at all.

It also allows for a simpler kick-start procedure for preparing a
standby, and allows to stop worrying too much about wal_keep_segments
and archive servers.

When do the standby launch its walreceiver? It would be extra-nice for
the base backup tool to optionally continue streaming WALs until the
standby starts doing it itself, so that wal_keep_segments is really
deprecated.  No idea how feasible that is, though.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

57 matches

Mail list logo