[ADMIN] Backup/disaster recovery and bandwidth (long)

2012-04-25 Thread Scott Whitney
Hello, everyone. I want to throw a scenario out there to see what y'all think. 

Soon, my cluster backups will be increasing in size inordinately. They're going 
to immediately go to 3x as large as they currently are with the potential to be 
about 20x within a year or so. 

My current setup uses a single PG 8.x server doing nightly dumps (not ideal but 
sufficient for the moment, and one of the main reasons to move to PG 9) which 
are then downloaded from my hosting center to our offices for DR purposes. Each 
night I pull down roughly 5GB of compressed pg_dump data. Dumping this takes 
about 1.5hrs. Downloading this at 15Mbps takes about an hour. Soon I'll be 
looking at somewhere around 7hrs for the dumps to complete and downloading a 
12GB file (which will take about 3 hrs). Oh, and I'll have to pay for 
significant bandwidth overage since I'm charged on a 95%, and while an hour a 
day does NOT kick me up to 15Mbps usage at 95%, 3hrs per night certainly will, 
so there's a real cost associated with this strategy as well. 

While the time of the actual dumps is not a huge issue, the time of the 
download IS a large concern, especially since my support folks use that file 
daily to extract individual customer databases for restore in assisting 
customer support issues. 

So, while now I have my pg_dumps completed around 2AM and downloaded to my 
local network at about 3AM, with the increase in our database sizes, what will 
be happening is that my pg_dump will not be completed until around 7AM, and the 
download would not be completed until around 10AM, best-case scenario. Add into 
that support trying to restore a database...more on that in a moment. 

My _new_ setup will instead be 2 PG 9.x servers with hot-standby enabled (at my 
hosting center) and a 3rd PG 9.x server at my local office also replicating off 
of the master. Each one of those servers will perform his own pg_dumps of the 
individual databases for backup/disaster recovery purposes, and while each dump 
might not be consistent with one another, each SERVER will have dumps 
consistent to itself, which is viable for our situation, and does not require 
me to download 12GB (or more) each night with all of those associated 
nightmares, costs and other problems. 

Alright, well, I've got that part all thought out, and it seems like a good way 
to do it to me, but I'm _still_ running into the situation that I've got to 
take 8hrs-ish to run the pg_dump no matter where it runs, and when my support 
folks need it (which they do daily), this basically means that if they have to 
have a customer database up NOW NOW NOW for support reasons, they simply cannot 
have it within an hour in many cases. Specifically, one database takes between 
2 and 7.5hrs to pg_dump depending on which format I use, so if they need a 
CURRENT copy, they're at least 4 hours out. Additionally, they can't directly 
use the replicating server at my local office, because they need to test the 
problems the customers are having which include pesky things like INSERT, 
UPDATE and DELETE, so they have to restore this data to another internal PG 
backend. 

Enter my outside-the-box thinking. 

I rather assume that you cannot do a start/stop backup on a hot-standby server. 
HOWEVER, what if 

I set up a 4th database server internally at my office. Each night I stop PG on 
my 3rd server (the local one replicating off of the master) and rsync my 
pg_data directory to this new 4th server. I bring up the 4th server NOT as a 
standby, but as a master. They would then have all customer data on an 
internal, usable PG system from the time of the rsync, and while it might not 
reflect the immediate state of the database, that's pretty well always true, 
and they're used to that, since whenever they clone a site, they're using the 
dumps done around midnight anyway. 

I believe, then, that when I restart server #3 (the standby who is 
replicating), he'll say oh, geez, I was down, let me catch up on all that crap 
that happened while I was out of the loop, he'll replay the WAL files that 
were written while he was down, and then he'll catch back up. 

Does this sound like a viable option? Or does someone have additional 
suggestions? 


Re: [ADMIN] Backup/disaster recovery and bandwidth (long)

2012-04-25 Thread amador alvarez

Hi Scott,
Why you do not replicate this master to the other location/s using other 
methods like bucardo?, you can pick the tables you really want get 
replicated there.
For the backup turn to hot backup (tar $PGDATA)+ archiving, easier, 
faster and more efficient rather than a logical copy with pgdump.


A.A

On 04/25/2012 09:11 AM, Scott Whitney wrote:
Hello, everyone. I want to throw a scenario out there to see what 
y'all think.


Soon, my cluster backups will be increasing in size inordinately. 
They're going to immediately go to 3x as large as they currently are 
with the potential to be about 20x within a year or so.


My current setup uses a single PG 8.x server doing nightly dumps (not 
ideal but sufficient for the moment, and one of the main reasons to 
move to PG 9) which are then downloaded from my hosting center to our 
offices for DR purposes. Each night I pull down roughly 5GB of 
compressed pg_dump data. Dumping this takes about 1.5hrs. Downloading 
this at 15Mbps takes about an hour. Soon I'll be looking at somewhere 
around 7hrs for the dumps to complete and downloading a 12GB file 
(which will take about 3 hrs). Oh, and I'll have to pay for 
significant bandwidth overage since I'm charged on a 95%, and while an 
hour a day does NOT kick me up to 15Mbps usage at 95%, 3hrs per night 
certainly will, so there's a real cost associated with this strategy 
as well.


While the time of the actual dumps is not a huge issue, the time of 
the download IS a large concern, especially since my support folks use 
that file daily to extract individual customer databases for restore 
in assisting customer support issues.


So, while now I have my pg_dumps completed around 2AM and downloaded 
to my local network at about 3AM, with the increase in our database 
sizes, what will be happening is that my pg_dump will not be completed 
until around 7AM, and the download would not be completed until around 
10AM, best-case scenario. Add into that support trying to restore a 
database...more on that in a moment.


My _new_ setup will instead be 2 PG 9.x servers with hot-standby 
enabled (at my hosting center) and a 3rd PG 9.x server at my local 
office also replicating off of the master. Each one of those servers 
will perform his own pg_dumps of the individual databases for 
backup/disaster recovery purposes, and while each dump might not be 
consistent with one another, each SERVER will have dumps consistent to 
itself, which is viable for our situation, and does not require me to 
download 12GB (or more) each night with all of those associated 
nightmares, costs and other problems.


Alright, well, I've got that part all thought out, and it seems like a 
good way to do it to me, but I'm _still_ running into the situation 
that I've got to take 8hrs-ish to run the pg_dump no matter where it 
runs, and when my support folks need it (which they do daily), this 
basically means that if they have to have a customer database up NOW 
NOW NOW for support reasons, they simply cannot have it within an hour 
in many cases. Specifically, one database takes between 2 and 7.5hrs 
to pg_dump depending on which format I use, so if they need a CURRENT 
copy, they're at least 4 hours out. Additionally, they can't directly 
use the replicating server at my local office, because they need to 
test the problems the customers are having which include pesky things 
like INSERT, UPDATE and DELETE, so they have to restore this data to 
another internal PG backend.


Enter my outside-the-box thinking.

I rather assume that you cannot do a start/stop backup on a 
hot-standby server. HOWEVER, what if


I set up a 4th database server internally at my office. Each night I 
stop PG on my 3rd server (the local one replicating off of the master) 
and rsync my pg_data directory to this new 4th server. I bring up the 
4th server NOT as a standby, but as a master. They would then have all 
customer data on an internal, usable PG system from the time of the 
rsync, and while it might not reflect the immediate state of the 
database, that's pretty well always true, and they're used to that, 
since whenever they clone a site, they're using the dumps done 
around midnight anyway.


I believe, then, that when I restart server #3 (the standby who is 
replicating), he'll say oh, geez, I was down, let me catch up on all 
that crap that happened while I was out of the loop, he'll replay the 
WAL files that were written while he was down, and then he'll catch 
back up.


Does this sound like a viable option? Or does someone have additional 
suggestions?


Re: [ADMIN] Backup/disaster recovery and bandwidth (long)

2012-04-25 Thread Scott Whitney
Both good points, thanks, although I suspect that a direct network copy of the 
pg_data directory will be faster than a tar/untar event. 

- Original Message -

 Hi Scott,
 Why you do not replicate this master to the other location/s using
 other
 methods like bucardo?, you can pick the tables you really want get
 replicated there.
 For the backup turn to hot backup (tar $PGDATA)+ archiving, easier,
 faster and more efficient rather than a logical copy with pgdump.

 A.A

 On 04/25/2012 09:11 AM, Scott Whitney wrote:
  Hello, everyone. I want to throw a scenario out there to see what
  y'all think.
 
  Soon, my cluster backups will be increasing in size inordinately.
  They're going to immediately go to 3x as large as they currently
  are
  with the potential to be about 20x within a year or so.
 
  My current setup uses a single PG 8.x server doing nightly dumps
  (not
  ideal but sufficient for the moment, and one of the main reasons to
  move to PG 9) which are then downloaded from my hosting center to
  our
  offices for DR purposes. Each night I pull down roughly 5GB of
  compressed pg_dump data. Dumping this takes about 1.5hrs.
  Downloading
  this at 15Mbps takes about an hour. Soon I'll be looking at
  somewhere
  around 7hrs for the dumps to complete and downloading a 12GB file
  (which will take about 3 hrs). Oh, and I'll have to pay for
  significant bandwidth overage since I'm charged on a 95%, and while
  an
  hour a day does NOT kick me up to 15Mbps usage at 95%, 3hrs per
  night
  certainly will, so there's a real cost associated with this
  strategy
  as well.
 
  While the time of the actual dumps is not a huge issue, the time of
  the download IS a large concern, especially since my support folks
  use
  that file daily to extract individual customer databases for
  restore
  in assisting customer support issues.
 
  So, while now I have my pg_dumps completed around 2AM and
  downloaded
  to my local network at about 3AM, with the increase in our database
  sizes, what will be happening is that my pg_dump will not be
  completed
  until around 7AM, and the download would not be completed until
  around
  10AM, best-case scenario. Add into that support trying to restore a
  database...more on that in a moment.
 
  My _new_ setup will instead be 2 PG 9.x servers with hot-standby
  enabled (at my hosting center) and a 3rd PG 9.x server at my local
  office also replicating off of the master. Each one of those
  servers
  will perform his own pg_dumps of the individual databases for
  backup/disaster recovery purposes, and while each dump might not be
  consistent with one another, each SERVER will have dumps consistent
  to
  itself, which is viable for our situation, and does not require me
  to
  download 12GB (or more) each night with all of those associated
  nightmares, costs and other problems.
 
  Alright, well, I've got that part all thought out, and it seems
  like a
  good way to do it to me, but I'm _still_ running into the situation
  that I've got to take 8hrs-ish to run the pg_dump no matter where
  it
  runs, and when my support folks need it (which they do daily), this
  basically means that if they have to have a customer database up
  NOW
  NOW NOW for support reasons, they simply cannot have it within an
  hour
  in many cases. Specifically, one database takes between 2 and
  7.5hrs
  to pg_dump depending on which format I use, so if they need a
  CURRENT
  copy, they're at least 4 hours out. Additionally, they can't
  directly
  use the replicating server at my local office, because they need to
  test the problems the customers are having which include pesky
  things
  like INSERT, UPDATE and DELETE, so they have to restore this data
  to
  another internal PG backend.
 
  Enter my outside-the-box thinking.
 
  I rather assume that you cannot do a start/stop backup on a
  hot-standby server. HOWEVER, what if
 
  I set up a 4th database server internally at my office. Each night
  I
  stop PG on my 3rd server (the local one replicating off of the
  master)
  and rsync my pg_data directory to this new 4th server. I bring up
  the
  4th server NOT as a standby, but as a master. They would then have
  all
  customer data on an internal, usable PG system from the time of the
  rsync, and while it might not reflect the immediate state of the
  database, that's pretty well always true, and they're used to that,
  since whenever they clone a site, they're using the dumps done
  around midnight anyway.
 
  I believe, then, that when I restart server #3 (the standby who is
  replicating), he'll say oh, geez, I was down, let me catch up on
  all
  that crap that happened while I was out of the loop, he'll replay
  the
  WAL files that were written while he was down, and then he'll catch
  back up.
 
  Does this sound like a viable option? Or does someone have
  additional
  suggestions?


Re: [ADMIN] Backup/disaster recovery and bandwidth (long)

2012-04-25 Thread Scott Ribe
On Apr 25, 2012, at 10:11 AM, Scott Whitney wrote:

 I believe, then, that when I restart server #3 (the standby who is 
 replicating), he'll say oh, geez, I was down, let me catch up on all that 
 crap that happened while I was out of the loop, he'll replay the WAL files 
 that were written while he was down, and then he'll catch back up.
 
 Does this sound like a viable option? Or does someone have additional 
 suggestions?

Perfectly viable. However, the WAL files must exist for this to happen. So you 
need to set wal_keep_segments appropriately, or set up WAL archiving.

Note that you could even provide a more up-to-date database for your people to 
work with. If the testbed is nearly up to date, then an rsync to update it 
would take very little time. So you could shut down the replica, rsync, and 
bring the replica back up.


-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: [ADMIN] Backup/disaster recovery and bandwidth (long)

2012-04-25 Thread Steve Crawford

On 04/25/2012 09:11 AM, Scott Whitney wrote:

...
My current setup uses a single PG 8.x...
My _new_ setup will instead be 2 PG 9.x ...
It is best to specify actual major version. While 8.0.x or 9.1.x is 
sufficient to discuss features and capabilities, 9.1 is a different 
major release than 9.0, not a minor update to version 9.


I set up a 4th database server internally at my office. Each night I 
stop PG on my 3rd server (the local one replicating off of the master) 
and rsync my pg_data directory to this new 4th server. I bring up the 
4th server NOT as a standby, but as a master


Does this sound like a viable option? Or does someone have additional 
suggestions?


And speaking of major versions, what you really want is PostgreSQL 
version 9.2 with cascading replication:

http://www.postgresql.org/docs/devel/static/warm-standby.html#CASCADING-REPLICATION

Unfortunately that version is currently in development/testing and will 
probably not be released till toward the end of the year (based on my 
totally uninformed guesstimate method).


With cascading replication you could maintain a constantly up-to-date 
local copy which you could cascade-replicate to other clusters as 
necessary. Whether you maintain one or more constantly updated local 
cascaded replicas off your primary local replica or just spin one off as 
necessary will be determined by how quickly you need to access the 
test/dev/debug replica. It's likely that you can come up with a pretty 
fast method of spinning off a cascaded replica as needed.


Note that there is no requirement for the various replicas to reside on 
different servers. You can run multiple clusters on a single machine and 
one cluster can replicate to one or more others. Not good as a backup 
strategy, obviously, but might be just the ticket for your needs - 
especially since the data copies necessary to bring up a replica are all 
on your local disks - no network transfer required.


You may want to experiment with a current development copy of 9.2 to see 
how it works (and provide bug reports to the developers). Perhaps 9.2 
final will be released before your database grows too big for current 
solutions.


Cheers,
Steve



Re: [ADMIN] Backup/disaster recovery and bandwidth (long)

2012-04-25 Thread amador alvarez
I  mean bucardo (even though there are more tools like this one) just 
for the replication stuff and the hot database backup only for the 
backup stuff and only one bounce is needed to turn the archiving on, you 
do not need to turn anything at all down during the backup.


A.A

On 04/25/2012 10:23 AM, Scott Ribe wrote:

On Apr 25, 2012, at 10:11 AM, Scott Whitney wrote:


I believe, then, that when I restart server #3 (the standby who is replicating), he'll 
say oh, geez, I was down, let me catch up on all that crap that happened while I 
was out of the loop, he'll replay the WAL files that were written while he was 
down, and then he'll catch back up.

Does this sound like a viable option? Or does someone have additional 
suggestions?

Perfectly viable. However, the WAL files must exist for this to happen. So you 
need to set wal_keep_segments appropriately, or set up WAL archiving.

Note that you could even provide a more up-to-date database for your people to 
work with. If the testbed is nearly up to date, then an rsync to update it 
would take very little time. So you could shut down the replica, rsync, and 
bring the replica back up.




--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin