Re: [HACKERS] Docs, backups, and MS VSS

2016-07-03 Thread Craig Ringer
On 2 July 2016 at 22:31, Craig Ringer  wrote:

>
> - Microsoft VSS is NOT safe, as it fails point 2. It is atomic only on a
> per-file level. You MUST use pg_start_backup() and pg_stop_backup() with
> WAL archiving or automated copy of the extra WAL if you use MS VSS. Most
> Windows backup products use MS VSS internally. You must ensure they have
> dedicated PostgreSQL backup support, using pg_basebackup,
> pg_dump/pg_restore, or pg_start_backup()/pg_stop_backup().
>

So, I read the manual [1] to confirm this, and I was wrong. VSS should be
fine if used correctly by the backup application, with a single snapshot
used for the entire PostgreSQL data directory. It is in fact block-level
CoW like LVM, or at least the default provider is.

It turns out that on Windows we could actually support safe snapshot based
backups across multiple volumes, too, by shipping a VSS Writer for
PostgreSQL that notifies PostgreSQL to perform a checkpoint and block
future XLogInsert()s until released, so we don't need a fully atomic
snapshot anymore. However, this would only work if the backup application
that the main datadir, xlog and any tablespace volumes be included in a
single VSS snapshot.

In the absence of a VSS Writer for PostgreSQL(provided by the backup app or
PostgreSQL install) and confirmation that the backup app snapshots all
volumes together, then like LVM it'll only be safe if the backup app does
pg_start_backup(), copies, pg_stop_backup() and copies the extra WAL. We
should probably provide a tool to make this easier by listing all the WAL
file names for a range of LSNs.

Anyway - still worth documenting safety requirements, but contrary to what
I remembered VSS looks to be fine.

It looks like a PostgreSQL VSS Writer could make the process a lot
smoother, though, by enumerating all volumes PostgreSQL uses to make sure
they get snapshotted, trying to flush in-progress xacts and pause new ones,
etc.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Docs, backups, and MS VSS

2016-07-02 Thread Craig Ringer
On 2 July 2016 at 22:42, Bruce Momjian  wrote:

>
> > I suspect, but cannot prove, that it is also safe to snapshot pg_xlog on
> a
> > separate filesystem if and only if you take the datadir snapshot before
> the
> > pg_xlog snapshot and you have wal_keep_segments high enough to ensure
> that WAL
> > needeed by the redo checkpoint in the datadir snapshot is not removed. I
> > wouldn't want to do this, and certainly not document it, since it's way
> saner
> > to use pg_start_backup() etc.
>
> Yes, I have wanted to document that WAL-at-the-end is sufficient for
> non-atomic snapshots assuming the needed WAL is there.  However, even if
> the WAL is backed up, it doesn't mean we are going to read it during
> crash recovery, i.e. we only read from the last checkpoint or something
> like that.  I have no idea how to tell people when this is safe.
>
> My simplistic idea would be to tell people to run a checkpoint right
> before all the snapshots are taken, but even that doesn't seem 100%
> safe.  This needs someone who understands the WAL and how to tell people
> a totally safe procedure.
>

The main thing is to provide an easy way to get the filenames of all the
archives that are required. pg_start_backup() and pg_stop_backup() provide
the range of LSNs required to restore, but you have to - correctly -
convert them into xlog file names and copy everything in that range not
just the start and end. There's no simple way to ask PostgreSQL for the
file-list via a query, since we lack arithmetic operators for pg_lsn or any
sort of pg_xlogfile_name_next function or similar. You can easily get the
start and end filenames using pg_xlogfile_name() and rely on lexical
comparision of filenames but that's way less convenient than just getting a
file-list you can feed into rsync / tar / whatever.

It's too complicated. Yes, most users should just use pg_basebackup, but
that doesn't play well with snapshots etc. A pg_xlogfile_name_range
function would be a real help I think, so you could psql -qAt a simple
query to get all the xlogs you must copy from the saved LSNs reported by
pg_start_backup() and pg_stop_backup(). Especially if the docs incorporated
a sample script, including a test that marks the backup aborted/failed if
there are any missing files.




-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Docs, backups, and MS VSS

2016-07-02 Thread Bruce Momjian
On Sat, Jul  2, 2016 at 10:31:32PM +0800, Craig Ringer wrote:
> - Microsoft VSS is NOT safe, as it fails point 2. It is atomic only on a
> per-file level. You MUST use pg_start_backup() and pg_stop_backup() with WAL
> archiving or automated copy of the extra WAL if you use MS VSS. Most Windows
> backup products use MS VSS internally. You must ensure they have dedicated
> PostgreSQL backup support, using pg_basebackup, pg_dump/pg_restore, or
> pg_start_backup()/pg_stop_backup().

Yes, it would be good to point out that per-file snapshots are insufficient.

> - LVM is safe
> 
> - BTRFS should be fine
> 
> - Most SAN snapshots are fine, but verify with your vendor
> 
> 
> I suspect, but cannot prove, that it is also safe to snapshot pg_xlog on a
> separate filesystem if and only if you take the datadir snapshot before the
> pg_xlog snapshot and you have wal_keep_segments high enough to ensure that WAL
> needeed by the redo checkpoint in the datadir snapshot is not removed. I
> wouldn't want to do this, and certainly not document it, since it's way saner
> to use pg_start_backup() etc.

Yes, I have wanted to document that WAL-at-the-end is sufficient for
non-atomic snapshots assuming the needed WAL is there.  However, even if
the WAL is backed up, it doesn't mean we are going to read it during
crash recovery, i.e. we only read from the last checkpoint or something
like that.  I have no idea how to tell people when this is safe.

My simplistic idea would be to tell people to run a checkpoint right
before all the snapshots are taken, but even that doesn't seem 100%
safe.  This needs someone who understands the WAL and how to tell people
a totally safe procedure.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Docs, backups, and MS VSS

2016-07-02 Thread Craig Ringer
Hi all

I just noticed that the Pg docs on backups don't discuss what kind of
snapshots are safe for use without a pg_start_backup() and pg_stop_backup()
then copying the extra WAL.

I'd like to remedy that. My understanding is that it's safe to use a
filesystem or block device level snapshot without a pg_start_backup() and
pg_stop_backup() if:

1. The snapshot includes the entire PostgreSQL data directory including all
tablespaces and pg_xlog, i.e. everything is on one filesystem or block
device;

2. The snapshot mechanism guarantees an atomic snapshot, such that every
part of the filesystem or block device is snapshotted consistently at the
same effective moment in time.

This allows PostgreSQL to treat recovery from a snapshot just like recovery
from a crash or hard reset.

I'd like to document these conditions, and note that:

- Microsoft VSS is NOT safe, as it fails point 2. It is atomic only on a
per-file level. You MUST use pg_start_backup() and pg_stop_backup() with
WAL archiving or automated copy of the extra WAL if you use MS VSS. Most
Windows backup products use MS VSS internally. You must ensure they have
dedicated PostgreSQL backup support, using pg_basebackup,
pg_dump/pg_restore, or pg_start_backup()/pg_stop_backup().

- LVM is safe

- BTRFS should be fine

- Most SAN snapshots are fine, but verify with your vendor


I suspect, but cannot prove, that it is also safe to snapshot pg_xlog on a
separate filesystem if and only if you take the datadir snapshot before the
pg_xlog snapshot and you have wal_keep_segments high enough to ensure that
WAL needeed by the redo checkpoint in the datadir snapshot is not removed.
I wouldn't want to do this, and certainly not document it, since it's way
saner to use pg_start_backup() etc.

Reasonable? Will write the SGML if there's broad agreement here that it's
desirable.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services