Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Marco Nenciarini
Il 03/10/14 23:12, Andres Freund ha scritto:
 On 2014-10-03 17:31:45 +0200, Marco Nenciarini wrote:
 I've updated the wiki page
 https://wiki.postgresql.org/wiki/Incremental_backup following the result
 of discussion on hackers.

 Compared to first version, we switched from a timestamp+checksum based
 approach to one based on LSN.

 This patch adds an option to pg_basebackup and to replication protocol
 BASE_BACKUP command to generate a backup_profile file. It is almost
 useless by itself, but it is the foundation on which we will build the
 file based incremental backup (and hopefully a block based incremental
 backup after it).

 Any comment will be appreciated. In particular I'd appreciate comments
 on correctness of relnode files detection and LSN extraction code.
 
 Can you describe the algorithm you implemented in words?
 


Here it is the relnode files detection algorithm:

I've added a has_relfiles parameter to the sendDir function. If
has_relfiles is true every file in the directory is tested against the
validateRelfilenodeName function. If the response is true, the maxLSN
value is computed for the file.

The sendDir function is called with has_relfiles=true by sendTablespace
function and by sendDir itself when is recurring into a subdirectory

 * if has_relfiles is true
 * if we are recurring into a ./global or ./base directory

The validateRelfilenodeName has been taken from pg_computemaxlsn patch.

It's short enough to be pasted here:

static bool
validateRelfilenodename(char *name)
{
int pos = 0;

while ((name[pos] = '0')  (name[pos] = '9'))
pos++;

if (name[pos] == '_')
{
pos++;
while ((name[pos] = 'a')  (name[pos] = 'z'))
pos++;
}
if (name[pos] == '.')
{
pos++;
while ((name[pos] = '0')  (name[pos] = '9'))
pos++;
}

if (name[pos] == 0)
return true;
return false;
}


To compute the maxLSN for a file, as the file is sent in TAR_SEND_SIZE
chunks (32kb) and it is always a multiple of the block size, I've added
the following code inside the send cycle:


+   char *page;
+
+   /* Scan every page to find the max file LSN */
+   for (page = buf; page  buf + (off_t) cnt; page += (off_t) BLCKSZ) {
+   pagelsn = PageGetLSN(page);
+   if (filemaxlsn  pagelsn)
+   filemaxlsn = pagelsn;
+   }
+

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Marco Nenciarini
Il 04/10/14 08:35, Michael Paquier ha scritto:
 On Sat, Oct 4, 2014 at 12:31 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Compared to first version, we switched from a timestamp+checksum based
 approach to one based on LSN.
 Cool.
 
 This patch adds an option to pg_basebackup and to replication protocol
 BASE_BACKUP command to generate a backup_profile file. It is almost
 useless by itself, but it is the foundation on which we will build the
 file based incremental backup (and hopefully a block based incremental
 backup after it).
 Hm. I am not convinced by the backup profile file. What's wrong with
 having a client send only an LSN position to get a set of files (or
 partial files filed with blocks) newer than the position given, and
 have the client do all the rebuild analysis?
 

The main problem I see is the following: how a client can detect a
truncated or removed file?

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Robert Haas
On Mon, Oct 6, 2014 at 8:59 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Il 04/10/14 08:35, Michael Paquier ha scritto:
 On Sat, Oct 4, 2014 at 12:31 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Compared to first version, we switched from a timestamp+checksum based
 approach to one based on LSN.
 Cool.

 This patch adds an option to pg_basebackup and to replication protocol
 BASE_BACKUP command to generate a backup_profile file. It is almost
 useless by itself, but it is the foundation on which we will build the
 file based incremental backup (and hopefully a block based incremental
 backup after it).
 Hm. I am not convinced by the backup profile file. What's wrong with
 having a client send only an LSN position to get a set of files (or
 partial files filed with blocks) newer than the position given, and
 have the client do all the rebuild analysis?


 The main problem I see is the following: how a client can detect a
 truncated or removed file?

When you take a differential backup, the server needs to send some
piece of information about every file so that the client can compare
that list against what it already has.  But a full backup does not
need to include similar information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Marco Nenciarini
Il 03/10/14 22:47, Robert Haas ha scritto:
 On Fri, Oct 3, 2014 at 12:08 PM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Il 03/10/14 17:53, Heikki Linnakangas ha scritto:
 If we're going to need a profile file - and I'm not convinced of that -
 is there any reason to not always include it in the backup?

 The main reason is to have a centralized list of files that need to be
 present. Without a profile, you have to insert some sort of placeholder
 for kipped files.
 
 Why do you need to do that?  And where do you need to do that?
 
 It seems to me that there are three interesting operations:
 
 1. Take a full backup.  Basically, we already have this.  In the
 backup label file, make sure to note the newest LSN guaranteed to be
 present in the backup.

Don't we already have it in START WAL LOCATION?

 
 2. Take a differential backup.  In the backup label file, note the LSN
 of the fullback to which the differential backup is relative, and the
 newest LSN guaranteed to be present in the differential backup.  The
 actual backup can consist of a series of 20-byte buffer tags, those
 being the exact set of blocks newer than the base-backup's
 latest-guaranteed-to-be-present LSN.  Each buffer tag is followed by
 an 8kB block of data.  If a relfilenode is truncated or removed, you
 need some way to indicate that in the backup; e.g. include a buffertag
 with forknum = -(forknum + 1) and blocknum = the new number of blocks,
 or InvalidBlockNumber if removed entirely.

To have a working backup you need to ship each block which is newer than
latest-guaranteed-to-be-present in full backup and not newer than
latest-guaranteed-to-be-present in the current backup. Also, as a
further optimization, you can think about not sending the empty space in
the middle of each page.

My main concern here is about how postgres can remember that a
relfilenode has been deleted, in order to send the appropriate deletion
tag.

IMHO the easiest way is to send the full list of files along the backup
and let to the client the task to delete unneeded files. The backup
profile has this purpose.

Moreover, I do not like the idea of using only a stream of block as the
actual differential backup, for the following reasons:

* AFAIK, with the current infrastructure, you cannot do a backup with a
block stream only. To have a valid backup you need many files for which
the concept of LSN doesn't apply.

* I don't like to have all the data from the various
tablespace/db/whatever all mixed in the same stream. I'd prefer to have
the blocks saved on a per file basis.

 
 3. Apply a differential backup to a full backup to create an updated
 full backup.  This is just a matter of scanning the full backup and
 the differential backup and applying the changes in the differential
 backup to the full backup.

 You might want combinations of these, like something that does 2+3 as
 a single operation, for efficiency, or a way to copy a full backup and
 apply a differential backup to it as you go.  But that's it, right?
 What else do you need?
 

Nothing else. Once we agree on definition of involved files and
protocols formats, only the actual coding remains.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Robert Haas
On Mon, Oct 6, 2014 at 11:33 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 1. Take a full backup.  Basically, we already have this.  In the
 backup label file, make sure to note the newest LSN guaranteed to be
 present in the backup.

 Don't we already have it in START WAL LOCATION?

Yeah, probably.  I was too lazy to go look for it, but that sounds
like the right thing.

 2. Take a differential backup.  In the backup label file, note the LSN
 of the fullback to which the differential backup is relative, and the
 newest LSN guaranteed to be present in the differential backup.  The
 actual backup can consist of a series of 20-byte buffer tags, those
 being the exact set of blocks newer than the base-backup's
 latest-guaranteed-to-be-present LSN.  Each buffer tag is followed by
 an 8kB block of data.  If a relfilenode is truncated or removed, you
 need some way to indicate that in the backup; e.g. include a buffertag
 with forknum = -(forknum + 1) and blocknum = the new number of blocks,
 or InvalidBlockNumber if removed entirely.

 To have a working backup you need to ship each block which is newer than
 latest-guaranteed-to-be-present in full backup and not newer than
 latest-guaranteed-to-be-present in the current backup. Also, as a
 further optimization, you can think about not sending the empty space in
 the middle of each page.

Right.  Or compressing the data.

 My main concern here is about how postgres can remember that a
 relfilenode has been deleted, in order to send the appropriate deletion
 tag.

You also need to handle truncation.

 IMHO the easiest way is to send the full list of files along the backup
 and let to the client the task to delete unneeded files. The backup
 profile has this purpose.

 Moreover, I do not like the idea of using only a stream of block as the
 actual differential backup, for the following reasons:

 * AFAIK, with the current infrastructure, you cannot do a backup with a
 block stream only. To have a valid backup you need many files for which
 the concept of LSN doesn't apply.

 * I don't like to have all the data from the various
 tablespace/db/whatever all mixed in the same stream. I'd prefer to have
 the blocks saved on a per file basis.

OK, that makes sense.  But you still only need the file list when
sending a differential backup, not when sending a full backup.  So
maybe a differential backup looks like this:

- Ship a table-of-contents file with a list relation files currently
present and the length of each in blocks.
- For each block that's been modified since the original backup, ship
a file called delta_original file name which is of the form block
numberchanged block contents [...].

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Marco Nenciarini
Il 06/10/14 16:51, Robert Haas ha scritto:
 On Mon, Oct 6, 2014 at 8:59 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Il 04/10/14 08:35, Michael Paquier ha scritto:
 On Sat, Oct 4, 2014 at 12:31 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 Compared to first version, we switched from a timestamp+checksum based
 approach to one based on LSN.
 Cool.

 This patch adds an option to pg_basebackup and to replication protocol
 BASE_BACKUP command to generate a backup_profile file. It is almost
 useless by itself, but it is the foundation on which we will build the
 file based incremental backup (and hopefully a block based incremental
 backup after it).
 Hm. I am not convinced by the backup profile file. What's wrong with
 having a client send only an LSN position to get a set of files (or
 partial files filed with blocks) newer than the position given, and
 have the client do all the rebuild analysis?


 The main problem I see is the following: how a client can detect a
 truncated or removed file?
 
 When you take a differential backup, the server needs to send some
 piece of information about every file so that the client can compare
 that list against what it already has.  But a full backup does not
 need to include similar information.
 

I agree that a full backup does not need to include a profile.

I've added the option to require the profile even for a full backup, as
it can be useful for backup softwares. We could remove the option and
build the profile only during incremental backups, if required. However,
I would avoid the needing to scan the whole backup to know the size of
the recovered data directory, hence the backup profile.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Robert Haas
On Mon, Oct 6, 2014 at 11:51 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 I agree that a full backup does not need to include a profile.

 I've added the option to require the profile even for a full backup, as
 it can be useful for backup softwares. We could remove the option and
 build the profile only during incremental backups, if required. However,
 I would avoid the needing to scan the whole backup to know the size of
 the recovered data directory, hence the backup profile.

That doesn't seem to be buying you much.  Calling stat() on every file
in a directory tree is a pretty cheap operation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Gabriele Bartolini
Hello,

2014-10-06 17:51 GMT+02:00 Marco Nenciarini marco.nenciar...@2ndquadrant.it
:

 I agree that a full backup does not need to include a profile.

 I've added the option to require the profile even for a full backup, as
 it can be useful for backup softwares. We could remove the option and
 build the profile only during incremental backups, if required. However,
 I would avoid the needing to scan the whole backup to know the size of
 the recovered data directory, hence the backup profile.


I really like this approach.

I think we should leave users the ability to ship a profile file even in
case of full backup (by default disabled).

Thanks,
Gabriele


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Marco Nenciarini
Il 06/10/14 17:55, Robert Haas ha scritto:
 On Mon, Oct 6, 2014 at 11:51 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 I agree that a full backup does not need to include a profile.

 I've added the option to require the profile even for a full backup, as
 it can be useful for backup softwares. We could remove the option and
 build the profile only during incremental backups, if required. However,
 I would avoid the needing to scan the whole backup to know the size of
 the recovered data directory, hence the backup profile.
 
 That doesn't seem to be buying you much.  Calling stat() on every file
 in a directory tree is a pretty cheap operation.
 

In case of incremental backup it is not true. You have to read the delta
file to know the final size. You can optimize it putting this
information in the first few bytes, but in case of compressed tar format
you will need to scan the whole archive.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Heikki Linnakangas

On 10/06/2014 06:33 PM, Marco Nenciarini wrote:

Il 03/10/14 22:47, Robert Haas ha scritto:

2. Take a differential backup.  In the backup label file, note the LSN
of the fullback to which the differential backup is relative, and the
newest LSN guaranteed to be present in the differential backup.  The
actual backup can consist of a series of 20-byte buffer tags, those
being the exact set of blocks newer than the base-backup's
latest-guaranteed-to-be-present LSN.  Each buffer tag is followed by
an 8kB block of data.  If a relfilenode is truncated or removed, you
need some way to indicate that in the backup; e.g. include a buffertag
with forknum = -(forknum + 1) and blocknum = the new number of blocks,
or InvalidBlockNumber if removed entirely.


To have a working backup you need to ship each block which is newer than
latest-guaranteed-to-be-present in full backup and not newer than
latest-guaranteed-to-be-present in the current backup. Also, as a
further optimization, you can think about not sending the empty space in
the middle of each page.

My main concern here is about how postgres can remember that a
relfilenode has been deleted, in order to send the appropriate deletion
tag.

IMHO the easiest way is to send the full list of files along the backup
and let to the client the task to delete unneeded files. The backup
profile has this purpose.


Right, but the server doesn't need to send a separate backup profile 
file for that. Rather, anything that the server *didn't* send, should be 
deleted.


I think the missing piece in this puzzle is that even for unmodified 
blocks, the server should send a note saying the blocks were present, 
but not modified. So for each file present in the server, the server 
sends a block stream. For each block, it sends either the full block 
contents, if it was modified, or a simple indicator that it was not 
modified.


There's a downside to this, though. The client has to read the whole 
stream, before it knows which files were present. So when applying a 
block stream directly over an old backup, the client cannot delete files 
until it has applied all the other changes. That needs more needs more 
disk space. With a separate profile file that's sent *before* the rest 
of the backup, you could delete the obsolete files first. But that's not 
a very big deal. I would suggest that you leave out the profile file in 
the first version, and add it as an optimization later, if needed.



Moreover, I do not like the idea of using only a stream of block as the
actual differential backup, for the following reasons:

* AFAIK, with the current infrastructure, you cannot do a backup with a
block stream only. To have a valid backup you need many files for which
the concept of LSN doesn't apply.


Those should be sent in whole. At least in the first version. The 
non-relation files are small compared to relation files, so it's not too 
bad to just include them in full.



3. Apply a differential backup to a full backup to create an updated
full backup.  This is just a matter of scanning the full backup and
the differential backup and applying the changes in the differential
backup to the full backup.

You might want combinations of these, like something that does 2+3 as
a single operation, for efficiency, or a way to copy a full backup and
apply a differential backup to it as you go.  But that's it, right?
What else do you need?


Nothing else. Once we agree on definition of involved files and
protocols formats, only the actual coding remains.


BTW, regarding the protocol, I have an idea. Rather than invent a whole 
new file format to represent the modified blocks, can we reuse some 
existing binary diff file format? For example, the VCDIFF format (RFC 
3284). For each unmodified block, the server would send a vcdiff COPY 
instruction, to copy the block from the old backup, and for a modified 
block, the server would send an ADD instruction, with the new block 
contents. The VCDIFF file format is quite flexible, but we would only 
use a small subset of it. I believe that subset would be just as easy to 
generate in the backend as a custom file format, but you could then use 
an external tool (xdelta3, open-vcdiff) to apply the diff manually, in 
case of emergency. In essence, the server would send a tar stream as 
usual, but for each relation file, it would send a VCDIFF file with name 
relfilenode.vcdiff instead.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Marco Nenciarini
Il 06/10/14 17:50, Robert Haas ha scritto:
 On Mon, Oct 6, 2014 at 11:33 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 2. Take a differential backup.  In the backup label file, note the LSN
 of the fullback to which the differential backup is relative, and the
 newest LSN guaranteed to be present in the differential backup.  The
 actual backup can consist of a series of 20-byte buffer tags, those
 being the exact set of blocks newer than the base-backup's
 latest-guaranteed-to-be-present LSN.  Each buffer tag is followed by
 an 8kB block of data.  If a relfilenode is truncated or removed, you
 need some way to indicate that in the backup; e.g. include a buffertag
 with forknum = -(forknum + 1) and blocknum = the new number of blocks,
 or InvalidBlockNumber if removed entirely.

 To have a working backup you need to ship each block which is newer than
 latest-guaranteed-to-be-present in full backup and not newer than
 latest-guaranteed-to-be-present in the current backup. Also, as a
 further optimization, you can think about not sending the empty space in
 the middle of each page.
 
 Right.  Or compressing the data.

If we want to introduce compression on server side, I think that
compressing the whole tar stream would be more effective.

 
 My main concern here is about how postgres can remember that a
 relfilenode has been deleted, in order to send the appropriate deletion
 tag.
 
 You also need to handle truncation.

Yes, of course. The current backup profile contains the file size, and
it can be used to truncate the file to the right size.

 IMHO the easiest way is to send the full list of files along the backup
 and let to the client the task to delete unneeded files. The backup
 profile has this purpose.

 Moreover, I do not like the idea of using only a stream of block as the
 actual differential backup, for the following reasons:

 * AFAIK, with the current infrastructure, you cannot do a backup with a
 block stream only. To have a valid backup you need many files for which
 the concept of LSN doesn't apply.

 * I don't like to have all the data from the various
 tablespace/db/whatever all mixed in the same stream. I'd prefer to have
 the blocks saved on a per file basis.
 
 OK, that makes sense.  But you still only need the file list when
 sending a differential backup, not when sending a full backup.  So
 maybe a differential backup looks like this:
 
 - Ship a table-of-contents file with a list relation files currently
 present and the length of each in blocks.

Having the size in bytes allow you to use the same format for non-block
files. Am I missing any advantage of having the size in blocks over
having the size in bytes?

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Heikki Linnakangas

On 10/06/2014 07:06 PM, Marco Nenciarini wrote:

Il 06/10/14 17:55, Robert Haas ha scritto:

On Mon, Oct 6, 2014 at 11:51 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:

I agree that a full backup does not need to include a profile.

I've added the option to require the profile even for a full backup, as
it can be useful for backup softwares. We could remove the option and
build the profile only during incremental backups, if required. However,
I would avoid the needing to scan the whole backup to know the size of
the recovered data directory, hence the backup profile.


That doesn't seem to be buying you much.  Calling stat() on every file
in a directory tree is a pretty cheap operation.



In case of incremental backup it is not true. You have to read the delta
file to know the final size. You can optimize it putting this
information in the first few bytes, but in case of compressed tar format
you will need to scan the whole archive.


I think you're pretty much screwed with the compressed tar format 
anyway. The files in the .tar can be in different order in the 'diff' 
and the base backup, so you need to do random access anyway when you try 
apply the diff. And random access isn't very easy with uncompressed tar 
format either. I think it would be acceptable to only support 
incremental backups with the directory format.


In hindsight, our compressed tar format was not a very good choice, 
because it makes random access impossible.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Robert Haas
On Mon, Oct 6, 2014 at 12:06 PM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Il 06/10/14 17:55, Robert Haas ha scritto:
 On Mon, Oct 6, 2014 at 11:51 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 I agree that a full backup does not need to include a profile.

 I've added the option to require the profile even for a full backup, as
 it can be useful for backup softwares. We could remove the option and
 build the profile only during incremental backups, if required. However,
 I would avoid the needing to scan the whole backup to know the size of
 the recovered data directory, hence the backup profile.

 That doesn't seem to be buying you much.  Calling stat() on every file
 in a directory tree is a pretty cheap operation.


 In case of incremental backup it is not true. You have to read the delta
 file to know the final size. You can optimize it putting this
 information in the first few bytes, but in case of compressed tar format
 you will need to scan the whole archive.

Well, sure.  But I never objected to sending a profile in a
differential backup.  I'm just objecting to sending one in a full
backup.  At least not without a more compelling reason why we need it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Robert Haas
On Mon, Oct 6, 2014 at 12:18 PM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 - Ship a table-of-contents file with a list relation files currently
 present and the length of each in blocks.

 Having the size in bytes allow you to use the same format for non-block
 files. Am I missing any advantage of having the size in blocks over
 having the size in bytes?

Size in bytes would be fine, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread Heikki Linnakangas

On 10/06/2014 07:00 PM, Gabriele Bartolini wrote:

Hello,

2014-10-06 17:51 GMT+02:00 Marco Nenciarini marco.nenciar...@2ndquadrant.it

:



I agree that a full backup does not need to include a profile.

I've added the option to require the profile even for a full backup, as
it can be useful for backup softwares. We could remove the option and
build the profile only during incremental backups, if required. However,
I would avoid the needing to scan the whole backup to know the size of
the recovered data directory, hence the backup profile.


I really like this approach.

I think we should leave users the ability to ship a profile file even in
case of full backup (by default disabled).


I don't see the point of making the profile optional. Why burden the 
user with that decision? I'm not convinced we need it at all, but if 
we're going to have a profile file, it should always be included.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-06 Thread David Fetter
On Mon, Oct 06, 2014 at 07:24:32PM +0300, Heikki Linnakangas wrote:
 On 10/06/2014 07:00 PM, Gabriele Bartolini wrote:
 Hello,
 
 2014-10-06 17:51 GMT+02:00 Marco Nenciarini marco.nenciar...@2ndquadrant.it
 :
 
 I agree that a full backup does not need to include a profile.
 
 I've added the option to require the profile even for a full backup, as
 it can be useful for backup softwares. We could remove the option and
 build the profile only during incremental backups, if required. However,
 I would avoid the needing to scan the whole backup to know the size of
 the recovered data directory, hence the backup profile.
 
 I really like this approach.
 
 I think we should leave users the ability to ship a profile file even in
 case of full backup (by default disabled).
 
 I don't see the point of making the profile optional. Why burden the user
 with that decision? I'm not convinced we need it at all, but if we're going
 to have a profile file, it should always be included.

+1 for fewer user decisions, especially with something light-weight in
resource consumption like the profile.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-04 Thread Michael Paquier
On Sat, Oct 4, 2014 at 12:31 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Compared to first version, we switched from a timestamp+checksum based
 approach to one based on LSN.
Cool.

 This patch adds an option to pg_basebackup and to replication protocol
 BASE_BACKUP command to generate a backup_profile file. It is almost
 useless by itself, but it is the foundation on which we will build the
 file based incremental backup (and hopefully a block based incremental
 backup after it).
Hm. I am not convinced by the backup profile file. What's wrong with
having a client send only an LSN position to get a set of files (or
partial files filed with blocks) newer than the position given, and
have the client do all the rebuild analysis?

 Any comment will be appreciated. In particular I'd appreciate comments
 on correctness of relnode files detection and LSN extraction code.
Please include some documentation with the patch once you consider
that this is worth adding to a commit fest. This is clearly WIP yet so
it does not matter much, but that's something not to forget.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-03 Thread Heikki Linnakangas

On 10/03/2014 06:31 PM, Marco Nenciarini wrote:

Hi Hackers,

I've updated the wiki page
https://wiki.postgresql.org/wiki/Incremental_backup following the result
of discussion on hackers.

Compared to first version, we switched from a timestamp+checksum based
approach to one based on LSN.

This patch adds an option to pg_basebackup and to replication protocol
BASE_BACKUP command to generate a backup_profile file. It is almost
useless by itself, but it is the foundation on which we will build the
file based incremental backup (and hopefully a block based incremental
backup after it).


I'd suggest jumping straight to block-based incremental backup. It's not 
significantly more complicated to implement, and if you implement both 
separately, then we'll have to support both forever. If you really need 
to, you can implement file-level diff as a special case, where the 
server sends all blocks in the file, if any of them have an LSN  the 
cutoff point. But I'm not sure if there's point in that, once you have 
block-level support.


If we're going to need a profile file - and I'm not convinced of that - 
is there any reason to not always include it in the backup?



Any comment will be appreciated. In particular I'd appreciate comments
on correctness of relnode files detection and LSN extraction code.


I didn't look at it in detail, but one future problem comes to mind: 
Once you implement the server-side code that only sends a file if its 
LSN is higher than the cutoff point that the client gave, you'll have to 
scan the whole file first, to see if there are any blocks with a higher 
LSN. At least until you find the first such block. So with a file-level 
implementation of this sort, you'll have to scan all files twice, in the 
worst case.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-03 Thread Marco Nenciarini
Il 03/10/14 17:53, Heikki Linnakangas ha scritto:
 If we're going to need a profile file - and I'm not convinced of that -
 is there any reason to not always include it in the backup?
 

The main reason is to have a centralized list of files that need to be
present. Without a profile, you have to insert some sort of placeholder
for kipped files. Moreover, the profile allows you to quickly know the
size of the recovered backup (by simply summing the individual size).
Another use could be to 'validate' the presence of all required files in
a backup.

 Any comment will be appreciated. In particular I'd appreciate comments
 on correctness of relnode files detection and LSN extraction code.
 
 I didn't look at it in detail, but one future problem comes to mind:
 Once you implement the server-side code that only sends a file if its
 LSN is higher than the cutoff point that the client gave, you'll have to
 scan the whole file first, to see if there are any blocks with a higher
 LSN. At least until you find the first such block. So with a file-level
 implementation of this sort, you'll have to scan all files twice, in the
 worst case.
 

It's true. To solve this you have to keep a central maxLSN directory,
but I think it introduces more issues than it solves.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-03 Thread Claudio Freire
On Fri, Oct 3, 2014 at 1:08 PM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Any comment will be appreciated. In particular I'd appreciate comments
 on correctness of relnode files detection and LSN extraction code.

 I didn't look at it in detail, but one future problem comes to mind:
 Once you implement the server-side code that only sends a file if its
 LSN is higher than the cutoff point that the client gave, you'll have to
 scan the whole file first, to see if there are any blocks with a higher
 LSN. At least until you find the first such block. So with a file-level
 implementation of this sort, you'll have to scan all files twice, in the
 worst case.


 It's true. To solve this you have to keep a central maxLSN directory,
 but I think it introduces more issues than it solves.

I see that as a worthy optimization on the server side, regardless of
whether file or block-level backups are used, since it allows
efficient skipping of untouched segments (common for append-only
tables).

Still, it would be something to do after it works already (ie: it's an
optimization)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-03 Thread Bruce Momjian
On Fri, Oct  3, 2014 at 06:08:47PM +0200, Marco Nenciarini wrote:
  Any comment will be appreciated. In particular I'd appreciate comments
  on correctness of relnode files detection and LSN extraction code.
  
  I didn't look at it in detail, but one future problem comes to mind:
  Once you implement the server-side code that only sends a file if its
  LSN is higher than the cutoff point that the client gave, you'll have to
  scan the whole file first, to see if there are any blocks with a higher
  LSN. At least until you find the first such block. So with a file-level
  implementation of this sort, you'll have to scan all files twice, in the
  worst case.
  
 
 It's true. To solve this you have to keep a central maxLSN directory,
 but I think it introduces more issues than it solves.

The central issue Heikki is pointing out is whether we should implement
a file-based system if we already know that a block-based system will be
superior in every way.  I agree with that and agree that implementing
just file-based isn't worth it as we would have to support it forever.

So, in summary, if you target just a file-based system, be prepared that
it might be rejected.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-03 Thread Robert Haas
On Fri, Oct 3, 2014 at 12:08 PM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 Il 03/10/14 17:53, Heikki Linnakangas ha scritto:
 If we're going to need a profile file - and I'm not convinced of that -
 is there any reason to not always include it in the backup?

 The main reason is to have a centralized list of files that need to be
 present. Without a profile, you have to insert some sort of placeholder
 for kipped files.

Why do you need to do that?  And where do you need to do that?

It seems to me that there are three interesting operations:

1. Take a full backup.  Basically, we already have this.  In the
backup label file, make sure to note the newest LSN guaranteed to be
present in the backup.

2. Take a differential backup.  In the backup label file, note the LSN
of the fullback to which the differential backup is relative, and the
newest LSN guaranteed to be present in the differential backup.  The
actual backup can consist of a series of 20-byte buffer tags, those
being the exact set of blocks newer than the base-backup's
latest-guaranteed-to-be-present LSN.  Each buffer tag is followed by
an 8kB block of data.  If a relfilenode is truncated or removed, you
need some way to indicate that in the backup; e.g. include a buffertag
with forknum = -(forknum + 1) and blocknum = the new number of blocks,
or InvalidBlockNumber if removed entirely.

3. Apply a differential backup to a full backup to create an updated
full backup.  This is just a matter of scanning the full backup and
the differential backup and applying the changes in the differential
backup to the full backup.

You might want combinations of these, like something that does 2+3 as
a single operation, for efficiency, or a way to copy a full backup and
apply a differential backup to it as you go.  But that's it, right?
What else do you need?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v2: add backup profile to base backup

2014-10-03 Thread Andres Freund
On 2014-10-03 17:31:45 +0200, Marco Nenciarini wrote:
 I've updated the wiki page
 https://wiki.postgresql.org/wiki/Incremental_backup following the result
 of discussion on hackers.
 
 Compared to first version, we switched from a timestamp+checksum based
 approach to one based on LSN.
 
 This patch adds an option to pg_basebackup and to replication protocol
 BASE_BACKUP command to generate a backup_profile file. It is almost
 useless by itself, but it is the foundation on which we will build the
 file based incremental backup (and hopefully a block based incremental
 backup after it).
 
 Any comment will be appreciated. In particular I'd appreciate comments
 on correctness of relnode files detection and LSN extraction code.

Can you describe the algorithm you implemented in words?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers