Re: [GENERAL] Backup advice

2013-04-16 Thread Eduardo Morras
On Mon, 15 Apr 2013 19:54:15 -0700
Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras
 emorr...@yahoo.esjavascript:_e({}, 'cvml', 'emorr...@yahoo.es');
  wrote:
 
  On Mon, 8 Apr 2013 10:40:16 -0500
  Shaun Thomas stho...@optionshouse.com javascript:_e({}, 'cvml',
  'stho...@optionshouse.com'); wrote:
 
  
   Anyone else?
  
 
  If his db has low inserts/updates/deletes he can use diff between pg_dumps
  (with default -Fp) before compressing.
 
 
 Most diff implementations will read the entirety of both files into
 memory, so may not work well with 200GB of data, unless it is broken into a
 large number of much smaller files.
 
 open-vcdiff only reads one of the files into memory, but I couldn't really
 figure out what happens memory-wise when you try to undo the resulting
 patch, the documentation is a bit mysterious.
 
 xdelta3 will work on streamed files of unlimited size, but it doesn't
 work very well unless the files fit in memory, or have the analogous data
 in the same order between the two files.

I use for my 12-13 GB dump files:

git diff -p 1.sql 2.sql  diff.patch


It uses 4MB for firts phase and upto 140MB on last one and makes a patch file 
that can be recovered with:

patch 1.sql  diff.patch  2.sql

or using git apply.

 A while ago I did some attempts to co-compress dump files, based on the
 notion that the pg_dump text format does not have \n within records so it
 is sortable as ordinary text, and that usually tables have their stable
 columns, like a pk, near the beginning of the table and volatile columns
 near the end, so that sorting the lines of several dump files together will
 gather replicate or near-replicate lines together where ordinary
 compression algorithms can work their magic.  So if you tag each line with
 its line number and which file it originally came from, then sort the lines
 (skipping the tag), you get much better compression.  But not nearly as
 good as open-vcdiff, assuming you have the RAM to spare.

 Using two dumps taken months apart on a slowly-changing database, it worked
 fairly well:
 
 cat 1.sql | pigz |wc -c
 329833147
 
 cat 2.sql | pigz |wc -c
 353716759
 
 cat 1.sql 2.sql | pigz |wc -c
 683548147
 
 sort -k2 (perl -lne 'print ${.}a\t$_' 1.sql) (perl -lne 'print
 ${.}b\t$_' 2.sql) | pigz |wc -c
 436350774
 
 A certain file could be recovered by, for example:
 
 zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' 
 2.sql2

Be careful, some z* utils decompress the whole file on /tmp (zdiff).

 There all kinds of short-comings here, of course, it was just a quick and
 dirty proof of concept.

A nice one !

 For now I think storage is cheap enough for what I need to do to make this
 not worth fleshing it out any more.
 
 Cheers,
 
 Jeff


---   ---
Eduardo Morras emorr...@yahoo.es


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Backup advice

2013-04-15 Thread Jeff Janes
On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras
emorr...@yahoo.esjavascript:_e({}, 'cvml', 'emorr...@yahoo.es');
 wrote:

 On Mon, 8 Apr 2013 10:40:16 -0500
 Shaun Thomas stho...@optionshouse.com javascript:_e({}, 'cvml',
 'stho...@optionshouse.com'); wrote:

 
  Anyone else?
 

 If his db has low inserts/updates/deletes he can use diff between pg_dumps
 (with default -Fp) before compressing.


Most diff implementations will read the entirety of both files into
memory, so may not work well with 200GB of data, unless it is broken into a
large number of much smaller files.

open-vcdiff only reads one of the files into memory, but I couldn't really
figure out what happens memory-wise when you try to undo the resulting
patch, the documentation is a bit mysterious.

xdelta3 will work on streamed files of unlimited size, but it doesn't
work very well unless the files fit in memory, or have the analogous data
in the same order between the two files.

A while ago I did some attempts to co-compress dump files, based on the
notion that the pg_dump text format does not have \n within records so it
is sortable as ordinary text, and that usually tables have their stable
columns, like a pk, near the beginning of the table and volatile columns
near the end, so that sorting the lines of several dump files together will
gather replicate or near-replicate lines together where ordinary
compression algorithms can work their magic.  So if you tag each line with
its line number and which file it originally came from, then sort the lines
(skipping the tag), you get much better compression.  But not nearly as
good as open-vcdiff, assuming you have the RAM to spare.

Using two dumps taken months apart on a slowly-changing database, it worked
fairly well:

cat 1.sql | pigz |wc -c
329833147

cat 2.sql | pigz |wc -c
353716759

cat 1.sql 2.sql | pigz |wc -c
683548147

sort -k2 (perl -lne 'print ${.}a\t$_' 1.sql) (perl -lne 'print
${.}b\t$_' 2.sql) | pigz |wc -c
436350774

A certain file could be recovered by, for example:

zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' 
2.sql2

There all kinds of short-comings here, of course, it was just a quick and
dirty proof of concept.

For now I think storage is cheap enough for what I need to do to make this
not worth fleshing it out any more.

Cheers,

Jeff


Re: [GENERAL] Backup advice

2013-04-10 Thread Johann Spies
Thanks everybody for your valuable comments.

I can specify how many versions of the files should be kept on Tivoli.

The database will regularly get new data and there is a continuous process
of data cleaning. It is a database mainly for research purposes and a few
researchers are using it.

I will explore the options mentioned. After the first read it looks like
continuing pg_dump is not a bad idea - maybe with some optimization (like
using diff's).

Regards
Johann

On 9 April 2013 12:05, Eduardo Morras emorr...@yahoo.es wrote:

 On Mon, 8 Apr 2013 10:40:16 -0500
 Shaun Thomas stho...@optionshouse.com wrote:

 
  Anyone else?
 

 If his db has low inserts/updates/deletes he can use diff between pg_dumps
 (with default -Fp) before compressing.

 ---   ---
 Eduardo Morras emorr...@yahoo.es


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general




-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)


Re: [GENERAL] Backup advice

2013-04-10 Thread Gabriele Bartolini

Hi Johann,

On Wed, 10 Apr 2013 09:58:05 +0200, Johann Spies 
johann.sp...@gmail.com wrote:

I can specify how many versions of the files should be kept on
Tivoli.


Another option you can evaluate is the usage of backup catalogues, 
retention policies and archiving of Barman (www.pgbarman.org). We use it 
in some contexts with Tivoli (currently only through file system backup, 
but I guess that's what you do anyway).


However, the approach is totally different and is based on physical 
backups and continuous archiving, allowing you to perform point in time 
recovery as well.


Maybe it is worth evaluating it.

Cheers,
Gabriele
--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it - www.2ndQuadrant.it


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-09 Thread Eduardo Morras
On Mon, 8 Apr 2013 10:40:16 -0500
Shaun Thomas stho...@optionshouse.com wrote:

 
 Anyone else?
 

If his db has low inserts/updates/deletes he can use diff between pg_dumps 
(with default -Fp) before compressing.

---   ---
Eduardo Morras emorr...@yahoo.es


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Backup advice

2013-04-08 Thread Johann Spies
I would appreciate some advice from the experts on this list about the best
backup strategy for my database.

The setup:

Size: might be about 200Gb
The server uses a Tivoli backup client with daily backup
At the moment There are pg_dumps for each database on the server on a
daily, weekly and monthly basis.  All this gets backed up to the Tivoli
server.

I would like to reduce the size of the backups as far as possible as we
have to pay for space on the backup server and I do not want any downtime
on the database server.

I do not want replication, but want to be able to restore the database as
recent as possible (at least as it was within the past 24 hours) as quickly
as possible.

I have read about using pg_basebackup in an article from Shaun Thomas'
booklet on Packt Publishers* *(I will probably buy the booklet)*.  *That
seems to be a possible solution.

I am considering dropping the pg_dumps in favour of the pg_basebackup
method.  Will that be wise?

Regards
Johann
-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)


Re: [GENERAL] Backup advice

2013-04-08 Thread Birta Levente

On 08/04/2013 16:14, Johann Spies wrote:

I would appreciate some advice from the experts on this list about the
best backup strategy for my database.

The setup:

Size: might be about 200Gb
The server uses a Tivoli backup client with daily backup
At the moment There are pg_dumps for each database on the server on a
daily, weekly and monthly basis.  All this gets backed up to the Tivoli
server.

I would like to reduce the size of the backups as far as possible as we
have to pay for space on the backup server and I do not want any
downtime on the database server.

I do not want replication, but want to be able to restore the database
as recent as possible (at least as it was within the past 24 hours) as
quickly as possible.

I have read about using pg_basebackup in an article from Shaun Thomas'
booklet on Packt Publishers**(I will probably buy the booklet)*. *That
seems to be a possible solution.

I am considering dropping the pg_dumps in favour of the pg_basebackup
method.  Will that be wise?

Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)



You should read this:
http://www.postgresql.org/docs/9.2/static/continuous-archiving.html


And you have to decide which method fit best. I dropped the dump method 
because the dumps size. The weekly basebackup and the WAL archives it's 
really smaller in size than dumps. In my case.


But, with PITR, you could restore only the whole cluster...cannot 
restore only one database.


And hey ... with PITR, you could restore the cluster to any specified 
moment or transaction ... not only to the time of dump or basebackup.


Levi







smime.p7s
Description: S/MIME Cryptographic Signature


Re: [GENERAL] Backup advice

2013-04-08 Thread Ian Lawrence Barwick
2013/4/8 Johann Spies johann.sp...@gmail.com:
 I would appreciate some advice from the experts on this list about the best
 backup strategy for my database.

(...)

 I have read about using pg_basebackup in an article from Shaun Thomas'
 booklet on Packt Publishers (I will probably buy the booklet).

Get the booklet, it's worth it.

Regards

Ian Barwick


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-08 Thread Shaun Thomas

On 04/08/2013 08:14 AM, Johann Spies wrote:


Size: might be about 200Gb
The server uses a Tivoli backup client with daily backup
At the moment There are pg_dumps for each database on the server on a
daily, weekly and monthly basis.  All this gets backed up to the Tivoli
server.


Ok. So far, so good.


I have read about using pg_basebackup in an article from Shaun
Thomas' booklet on Packt Publishers**(I will probably buy the
booklet)*. *That seems to be a possible solution.


Ok, with pg_basebackup, you'll get a binary backup of the actual data 
files involved in your database cluster. This will, in almost every 
case, be larger than pg_dump, and take about the same amount of time to 
produce. You also won't be able to get the described method you're using 
in your Tivoli software, since pg_basebackup works on the entire install 
instead of each individual database.


One benefit, as Birta pointed out, is that you could use this backup as 
a base, and apply WAL / transaction logs instead, and those are 
generally smaller if your database doesn't see a lot of daily overhead. 
Unfortunately if you do a weekly base, and need to recover far into the 
week, that can be a time-consuming process.


Then again, so can restoring a pg_dump of a 200GB cluster, thanks to the 
index creation points.


One thing you might want to consider, is that 9.3 will presumably have 
parallel pg_dump to complement parallel pg_restore. This would greatly 
reduce the amount of time a full dump-restore cycle requires.


Aside from that, you're already likely using the smallest backup you can 
have. Backing up a 600GB database takes about 120GB for us using tar, 
which is close to what pg_basebackup would give you in terms of size. A 
compressed pg_dump of the same data is around 50GB.


Still, we use the binaries, because we need uptime more than size, and 
they are much faster to restore. We have daily binary backups going back 
over four years, because the backups are probably the most important 
thing in the company. Lose those under bad circumstances, and we'd 
literally be out of business.


If size really is that important, you might want to check what kind of 
compression is going on with Tivoli. If you can produce a pg_dump and 
compress it with bzip2, or a utility such as lrzip that makes use of 
better compression algorithms like LZMA, you may be able to back up much 
smaller files than your current process.


I'll warn you, though. Even parallel compression methods like lbzip2 are 
much slower than something like parallel gzip (pigz). You'll get a 
20-30% smaller file at the cost of a 4-8x slower backup process. Since 
your data is likely to grow from its current size of 200GB, it's 
something to consider.


Anyone else?

I'm so used to using basic utilities, I know I haven't covered things 
like deduplication backup solutions. It may be that Tivoli isn't right 
for this, but I'm not familiar with that software.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-08 Thread Jeff Janes
On Mon, Apr 8, 2013 at 6:14 AM, Johann Spies johann.sp...@gmail.com wrote:

 I would appreciate some advice from the experts on this list about the
 best backup strategy for my database.

 The setup:

 Size: might be about 200Gb
 The server uses a Tivoli backup client with daily backup
 At the moment There are pg_dumps for each database on the server on a
 daily, weekly and monthly basis.  All this gets backed up to the Tivoli
 server.


Hi Johann,

This backup pattern means that after a year of operation you will have
about 23 pg_dump files, 90+% of which have nothing to do with your
requirement to restore the database to within the last 24 hours.

In other words, your storage needs are being driven by your historical
retention policy, not your immediate restoration policy.  So, can you
elaborate on your historical retention policy?  For example, if you need to
restore your database to the state it was in 9 months ago, how fast do you
need to be able to do that?  If you had a 12 month old pg_basebackup and 3
months of log files, how long would it take to replay those and how big
would that many log files be?  (Both of those questions depend on your
specific usage, so it is hard to make general guesses about those--they are
questions that can only be answered empirically.)

pg_basebackup will almost surely be larger than pg_dumps.  For one thing,
it contains all the index data, for another it contains any obsolete rows
which have not yet been vacuum and reused.  So switching to that will save
you space only if you need to keep less of them than you do of the pg_dumps.

Cheers,

Jeff