Re: [GENERAL] Backup advice

2013-04-16 Thread Eduardo Morras
On Mon, 15 Apr 2013 19:54:15 -0700
Jeff Janes  wrote:

> On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras
> 
> > wrote:
> 
> > On Mon, 8 Apr 2013 10:40:16 -0500
> > Shaun Thomas  > 'stho...@optionshouse.com');>> wrote:
> >
> > >
> > > Anyone else?
> > >
> >
> > If his db has low inserts/updates/deletes he can use diff between pg_dumps
> > (with default -Fp) before compressing.
> >
> 
> Most "diff" implementations will read the entirety of both files into
> memory, so may not work well with 200GB of data, unless it is broken into a
> large number of much smaller files.
> 
> open-vcdiff only reads one of the files into memory, but I couldn't really
> figure out what happens memory-wise when you try to undo the resulting
> patch, the documentation is a bit mysterious.
> 
> xdelta3 will "work" on streamed files of unlimited size, but it doesn't
> work very well unless the files fit in memory, or have the analogous data
> in the same order between the two files.

I use for my 12-13 GB dump files:

git diff -p 1.sql 2.sql > diff.patch


It uses 4MB for firts phase and upto 140MB on last one and makes a patch file 
that can be recovered with:

patch 1.sql < diff.patch > 2.sql

or using git apply.

> A while ago I did some attempts to "co-compress" dump files, based on the
> notion that the pg_dump text format does not have \n within records so it
> is sortable as ordinary text, and that usually tables have their "stable"
> columns, like a pk, near the beginning of the table and volatile columns
> near the end, so that sorting the lines of several dump files together will
> gather replicate or near-replicate lines together where ordinary
> compression algorithms can work their magic.  So if you tag each line with
> its line number and which file it originally came from, then sort the lines
> (skipping the tag), you get much better compression.  But not nearly as
> good as open-vcdiff, assuming you have the RAM to spare.
>
> Using two dumps taken months apart on a slowly-changing database, it worked
> fairly well:
> 
> cat 1.sql | pigz |wc -c
> 329833147
> 
> cat 2.sql | pigz |wc -c
> 353716759
> 
> cat 1.sql 2.sql | pigz |wc -c
> 683548147
> 
> sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print
> "${.}b\t$_"' 2.sql) | pigz |wc -c
> 436350774
> 
> A certain file could be recovered by, for example:
> 
> zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' >
> 2.sql2

Be careful, some z* utils decompress the whole file on /tmp (zdiff).

> There all kinds of short-comings here, of course, it was just a quick and
> dirty proof of concept.

A nice one !

> For now I think storage is cheap enough for what I need to do to make this
> not worth fleshing it out any more.
> 
> Cheers,
> 
> Jeff


---   ---
Eduardo Morras 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-10 Thread Gabriele Bartolini

Hi Johann,

On Wed, 10 Apr 2013 09:58:05 +0200, Johann Spies 
 wrote:

I can specify how many versions of the files should be kept on
Tivoli.


Another option you can evaluate is the usage of backup catalogues, 
retention policies and archiving of Barman (www.pgbarman.org). We use it 
in some contexts with Tivoli (currently only through file system backup, 
but I guess that's what you do anyway).


However, the approach is totally different and is based on physical 
backups and continuous archiving, allowing you to perform point in time 
recovery as well.


Maybe it is worth evaluating it.

Cheers,
Gabriele
--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it - www.2ndQuadrant.it


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-10 Thread Johann Spies
Thanks everybody for your valuable comments.

I can specify how many versions of the files should be kept on Tivoli.

The database will regularly get new data and there is a continuous process
of data cleaning. It is a database mainly for research purposes and a few
researchers are using it.

I will explore the options mentioned. After the first read it looks like
continuing pg_dump is not a bad idea - maybe with some optimization (like
using diff's).

Regards
Johann

On 9 April 2013 12:05, Eduardo Morras  wrote:

> On Mon, 8 Apr 2013 10:40:16 -0500
> Shaun Thomas  wrote:
>
> >
> > Anyone else?
> >
>
> If his db has low inserts/updates/deletes he can use diff between pg_dumps
> (with default -Fp) before compressing.
>
> ---   ---
> Eduardo Morras 
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)


Re: [GENERAL] Backup advice

2013-04-09 Thread Eduardo Morras
On Mon, 8 Apr 2013 10:40:16 -0500
Shaun Thomas  wrote:

> 
> Anyone else?
> 

If his db has low inserts/updates/deletes he can use diff between pg_dumps 
(with default -Fp) before compressing.

---   ---
Eduardo Morras 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-08 Thread Jeff Janes
On Mon, Apr 8, 2013 at 6:14 AM, Johann Spies  wrote:

> I would appreciate some advice from the experts on this list about the
> best backup strategy for my database.
>
> The setup:
>
> Size: might be about 200Gb
> The server uses a Tivoli backup client with daily backup
> At the moment There are pg_dumps for each database on the server on a
> daily, weekly and monthly basis.  All this gets backed up to the Tivoli
> server.
>

Hi Johann,

This backup pattern means that after a year of operation you will have
about 23 pg_dump files, 90+% of which have nothing to do with your
requirement to restore the database to within the last 24 hours.

In other words, your storage needs are being driven by your historical
retention policy, not your immediate restoration policy.  So, can you
elaborate on your historical retention policy?  For example, if you need to
restore your database to the state it was in 9 months ago, how fast do you
need to be able to do that?  If you had a 12 month old pg_basebackup and 3
months of log files, how long would it take to replay those and how big
would that many log files be?  (Both of those questions depend on your
specific usage, so it is hard to make general guesses about those--they are
questions that can only be answered empirically.)

pg_basebackup will almost surely be larger than pg_dumps.  For one thing,
it contains all the index data, for another it contains any obsolete rows
which have not yet been vacuum and reused.  So switching to that will save
you space only if you need to keep less of them than you do of the pg_dumps.

Cheers,

Jeff


Re: [GENERAL] Backup advice

2013-04-08 Thread Shaun Thomas

On 04/08/2013 08:14 AM, Johann Spies wrote:


Size: might be about 200Gb
The server uses a Tivoli backup client with daily backup
At the moment There are pg_dumps for each database on the server on a
daily, weekly and monthly basis.  All this gets backed up to the Tivoli
server.


Ok. So far, so good.


I have read about using pg_basebackup in an article from Shaun
Thomas' booklet on Packt Publishers**(I will probably buy the
booklet)*. *That seems to be a possible solution.


Ok, with pg_basebackup, you'll get a binary backup of the actual data 
files involved in your database cluster. This will, in almost every 
case, be larger than pg_dump, and take about the same amount of time to 
produce. You also won't be able to get the described method you're using 
in your Tivoli software, since pg_basebackup works on the entire install 
instead of each individual database.


One benefit, as Birta pointed out, is that you could use this backup as 
a base, and apply WAL / transaction logs instead, and those are 
generally smaller if your database doesn't see a lot of daily overhead. 
Unfortunately if you do a weekly base, and need to recover far into the 
week, that can be a time-consuming process.


Then again, so can restoring a pg_dump of a 200GB cluster, thanks to the 
index creation points.


One thing you might want to consider, is that 9.3 will presumably have 
parallel pg_dump to complement parallel pg_restore. This would greatly 
reduce the amount of time a full dump->restore cycle requires.


Aside from that, you're already likely using the smallest backup you can 
have. Backing up a 600GB database takes about 120GB for us using tar, 
which is close to what pg_basebackup would give you in terms of size. A 
compressed pg_dump of the same data is around 50GB.


Still, we use the binaries, because we need uptime more than size, and 
they are much faster to restore. We have daily binary backups going back 
over four years, because the backups are probably the most important 
thing in the company. Lose those under bad circumstances, and we'd 
literally be out of business.


If size really is that important, you might want to check what kind of 
compression is going on with Tivoli. If you can produce a pg_dump and 
compress it with bzip2, or a utility such as lrzip that makes use of 
better compression algorithms like LZMA, you may be able to back up much 
smaller files than your current process.


I'll warn you, though. Even parallel compression methods like lbzip2 are 
much slower than something like parallel gzip (pigz). You'll get a 
20-30% smaller file at the cost of a 4-8x slower backup process. Since 
your data is likely to grow from its current size of 200GB, it's 
something to consider.


Anyone else?

I'm so used to using basic utilities, I know I haven't covered things 
like deduplication backup solutions. It may be that Tivoli isn't right 
for this, but I'm not familiar with that software.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-08 Thread Ian Lawrence Barwick
2013/4/8 Johann Spies :
> I would appreciate some advice from the experts on this list about the best
> backup strategy for my database.

(...)
>
> I have read about using pg_basebackup in an article from Shaun Thomas'
> booklet on Packt Publishers (I will probably buy the booklet).

Get the booklet, it's worth it.

Regards

Ian Barwick


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup advice

2013-04-08 Thread Birta Levente

On 08/04/2013 16:14, Johann Spies wrote:

I would appreciate some advice from the experts on this list about the
best backup strategy for my database.

The setup:

Size: might be about 200Gb
The server uses a Tivoli backup client with daily backup
At the moment There are pg_dumps for each database on the server on a
daily, weekly and monthly basis.  All this gets backed up to the Tivoli
server.

I would like to reduce the size of the backups as far as possible as we
have to pay for space on the backup server and I do not want any
downtime on the database server.

I do not want replication, but want to be able to restore the database
as recent as possible (at least as it was within the past 24 hours) as
quickly as possible.

I have read about using pg_basebackup in an article from Shaun Thomas'
booklet on Packt Publishers**(I will probably buy the booklet)*. *That
seems to be a possible solution.

I am considering dropping the pg_dumps in favour of the pg_basebackup
method.  Will that be wise?

Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)



You should read this:
http://www.postgresql.org/docs/9.2/static/continuous-archiving.html


And you have to decide which method fit best. I dropped the dump method 
because the dumps size. The weekly basebackup and the WAL archives it's 
really smaller in size than dumps. In my case.


But, with PITR, you could restore only the whole cluster...cannot 
restore only one database.


And hey ... with PITR, you could restore the cluster to any specified 
moment or transaction ... not only to the time of dump or basebackup.


Levi







smime.p7s
Description: S/MIME Cryptographic Signature