Re: Backup scripts - recycling old backup directories (Kevin Korb)

2014-09-15 Thread Robert Bell

Kevin,

Thanks for the reply and interest in this topic.

Comments below.

Regards

Rob.


I did consider that but rejected it for 2 reasons...

1. Backup run time.  We have a 4 hour window to run backups at night.
 Using recycled directories significantly extended the backup run
time.  The deletion time is eliminated but frankly, we have the other
20 hours of the day to do deletions.  We had to give up using
- --link-dest when the deletions started to actually take that long even
though the backups still ran in under 4 hours.


For us, the recycling of old directories significantly shortened the time
to do backups, since the recycled backups have typically 95% of the
files/directories correct (with daily backups and Tower of Hanoi, 
half of our recycled backups are only 5 to 6 days old).


I've just done some tests with a fairly pathological case, all on one
host.

I set up a source tree 's' with 1 sub-directories and 1 files,
and then two destinations:
  cp -a s d1
  cp -afl d1 d2

I then did the first test:
  # rsync to a new directory, followed by a remove of an old directory.
  time rsync -a --link-dest=../d2 s/ d3
  time /bin/rm -rf d1


I then scrubbed the lot, set it up again, and did the second test:
  mv d1 d3
  # rsync to a recycled directory
  time rsync -a --link-dest=../d2 --delete s/ d3

I hope I got this right!  I've made no effort to circumvent caching.

Anyway, here is a table of the average times (seconds) over 5 runs of each test.

RealUserSys (User+Sys)
test 1  2.454s  0.150s  2.196s  2.346s
test 2  0.392s  0.100s  0.572s  0.672s
ratio 6.3 1.5 3.8 3.5

(The User+Sys time is pretty much invariant, even though in earlier tests
the real time suffered major blowouts owing to contention.)

So, the big difference is that in test 1, the 1 sub-directories and
1 files were created in the destination d3, and then the same
numbers were deleted from the old directory d1.  In test 2, rsync does
none of that, but only has to check for differences.  ~40,000 metadata
operations avoided on the filesystem in this case.




2. Metadata history.  If there is an existing file in the target dir
that differs only by metadata (permissions, ownership, timestamp) then
rsync will simply change that metadata.  That change affects all
instances of that file.  Of course this is better for storage space as
the alternative is storing another copy of the file with the different
metadata but we decided it was better to have that information saved.

Yes.


I would love to see someone make a patched version of rsync to allow
callers to select a different behaviour in this case!

So, if a file has identical content on source and destination but
different metadata, then if --link-dest is in use and the link count on
the destination is  1, then take a new copy from source rather than
just updating the metadata (the file could be copied on the destination
and then the copy updated with the new metadata and the old version
removed, but this would not be essential - just perhaps an efficiency
gain.)

Thanks in anticipation!


Dr Robert C. Bell
HPC National Partnerships | Scientific Computing
Information Management and Technology
CSIRO
T +61 3 9669 8102 Alt +61 3 8601 3810 Mob +61 428 108 333
robert.b...@csiro.aumailto:robert.b...@csiro.au | www.csiro.au | 
wiki.csiro.au/display/ASC/
Street: CSIRO ASC Level 11, 700 Collins Street, Docklands Vic 3008, Australia
Postal: CSIRO ASC Level 11, GPO Box 1289, Melbourne Vic 3001, Australia

PLEASE NOTE
The information contained in this email may be confidential or privileged.
Any unauthorised use or disclosure is prohibited.  If you have received
this email in error, please delete it immediately and notify the sender by
return email. Thank you.  To the extent permitted by law, CSIRO does not
represent, warrant and/or guarantee that the integrity of this
communication has been maintained or that the communication is free of
errors, virus, interception or interference.

Please consider the environment before printing this email.
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Backup scripts - recycling old backup directories (Kevin Korb)

2014-09-15 Thread Kevin Korb
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I would never operate in a manner that only has 5-6 days of old
backups.  The backups that I am deleting are more than a year old.

On 09/15/2014 02:24 AM, Robert Bell wrote:
 Kevin,
 
 Thanks for the reply and interest in this topic.
 
 Comments below.
 
 Regards
 
 Rob.
 
 I did consider that but rejected it for 2 reasons...
 
 1. Backup run time.  We have a 4 hour window to run backups at
 night. Using recycled directories significantly extended the
 backup run time.  The deletion time is eliminated but frankly, we
 have the other 20 hours of the day to do deletions.  We had to
 give up using - --link-dest when the deletions started to
 actually take that long even though the backups still ran in
 under 4 hours.
 
 For us, the recycling of old directories significantly shortened
 the time to do backups, since the recycled backups have typically
 95% of the files/directories correct (with daily backups and Tower
 of Hanoi, half of our recycled backups are only 5 to 6 days old).
 
 I've just done some tests with a fairly pathological case, all on
 one host.
 
 I set up a source tree 's' with 1 sub-directories and 1
 files, and then two destinations: cp -a s d1 cp -afl d1 d2
 
 I then did the first test: # rsync to a new directory, followed by
 a remove of an old directory. time rsync -a --link-dest=../d2 s/
 d3 time /bin/rm -rf d1
 
 
 I then scrubbed the lot, set it up again, and did the second test: 
 mv d1 d3 # rsync to a recycled directory time rsync -a
 --link-dest=../d2 --delete s/ d3
 
 I hope I got this right!  I've made no effort to circumvent
 caching.
 
 Anyway, here is a table of the average times (seconds) over 5 runs
 of each test.
 
 RealUserSys(User+Sys) test 12.454s0.150s
 2.196s2.346s test 20.392s0.100s0.572s0.672s 
 ratio  6.3  1.5  3.8  3.5
 
 (The User+Sys time is pretty much invariant, even though in earlier
 tests the real time suffered major blowouts owing to contention.)
 
 So, the big difference is that in test 1, the 1 sub-directories
 and 1 files were created in the destination d3, and then the
 same numbers were deleted from the old directory d1.  In test 2,
 rsync does none of that, but only has to check for differences.
 ~40,000 metadata operations avoided on the filesystem in this
 case.
 
 
 
 2. Metadata history.  If there is an existing file in the target
 dir that differs only by metadata (permissions, ownership,
 timestamp) then rsync will simply change that metadata.  That
 change affects all instances of that file.  Of course this is
 better for storage space as the alternative is storing another
 copy of the file with the different metadata but we decided it
 was better to have that information saved.
 Yes.
 
 
 I would love to see someone make a patched version of rsync to
 allow callers to select a different behaviour in this case!
 
 So, if a file has identical content on source and destination but 
 different metadata, then if --link-dest is in use and the link
 count on the destination is  1, then take a new copy from source
 rather than just updating the metadata (the file could be copied on
 the destination and then the copy updated with the new metadata and
 the old version removed, but this would not be essential - just
 perhaps an efficiency gain.)
 
 Thanks in anticipation!
 
 
 Dr Robert C. Bell HPC National Partnerships | Scientific Computing 
 Information Management and Technology CSIRO T +61 3 9669 8102 Alt
 +61 3 8601 3810 Mob +61 428 108 333 
 robert.b...@csiro.aumailto:robert.b...@csiro.au | www.csiro.au | 
 wiki.csiro.au/display/ASC/ Street: CSIRO ASC Level 11, 700 Collins
 Street, Docklands Vic 3008, Australia Postal: CSIRO ASC Level 11,
 GPO Box 1289, Melbourne Vic 3001, Australia
 
 PLEASE NOTE The information contained in this email may be
 confidential or privileged. Any unauthorised use or disclosure is
 prohibited.  If you have received this email in error, please
 delete it immediately and notify the sender by return email. Thank
 you.  To the extent permitted by law, CSIRO does not represent,
 warrant and/or guarantee that the integrity of this communication
 has been maintained or that the communication is free of errors,
 virus, interception or interference.
 
 Please consider the environment before printing this email.

- -- 
~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~
Kevin Korb  Phone:(407) 252-6853
Systems Administrator   Internet:
FutureQuest, Inc.   ke...@futurequest.net  (work)
Orlando, Floridak...@sanitarium.net (personal)
Web page:   http://www.sanitarium.net/
PGP public key available on web site.
~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~
-BEGIN PGP SIGNATURE-
Version: GnuPG v2


Re: Backup scripts - recycling old backup directories (Kevin Korb)

2014-09-15 Thread Paul Slootman
On Mon 15 Sep 2014, Kevin Korb wrote:
 
 I would never operate in a manner that only has 5-6 days of old
 backups.  The backups that I am deleting are more than a year old.

I keep the Sunday backups for a month, the 1st of the month backups for
a year.

The other daily backups are expired after 10-20 days depending on
importance etc.



Paul
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Recycling and keeping backups - Tower of Hanoi management of backups using rsync

2014-09-15 Thread Robert Bell

Thanks to Kevin and Paul for responses.


We use a modified Tower of Hanoi scheme (on top of rsync and --link-dest
and recycling) for deciding which backups to keep.

Here is a sample of our holdings for one area:

home.2024.seq.0   set  0
home.20130512.seq.512 set 10
home.20140203.seq.768 set  9
home.20140414.seq.832 set  7
home.20140708.seq.896 set  8
home.20140815.seq.928 set  6
home.20140831.seq.944 set  5
home.20140904.seq.948 set  3
home.20140908.seq.950 set  2
home.20140909.seq.951 set  1
home.20140910.seq.952 set  4
home.20140911.seq.953 set  1
home.20140912.seq.954 set  2
Found 13 backups as expected up to sequence number 954
Marking for recycling home.20140908.seq.950, set number 2

The coverage matches the likelihood of restorations being required - the
coverage tails off exponentially over time.

I found the key to running a Tower of Hanoi scheme is to assign a
sequence number to each backup, from which you can derive a set number.

The dates are for humans - scripts don't have to deal with days of week,
days of months, etc


We keep one of each set number, except we keep two set ones, and a 
'set 0' as a base set.


With daily backups, any file that has existed for a period of more than
about 1.5*n days in the last 2*n days will be covered, with better
coverage than that for recent files.

A strategy to keep two of every set will easily provide cover for every
file that existed for n days in the last 2*n days.

Kevin: you could use Tower of Hanoi for managing your snapshots...  :-)


Regards

Rob.

Dr Robert C. Bell
HPC National Partnerships | Scientific Computing
Information Management and Technology
CSIRO
T +61 3 9669 8102 Alt +61 3 8601 3810 Mob +61 428 108 333
robert.b...@csiro.aumailto:robert.b...@csiro.au | www.csiro.au | 
wiki.csiro.au/display/ASC/
Street: CSIRO ASC Level 11, 700 Collins Street, Docklands Vic 3008, Australia
Postal: CSIRO ASC Level 11, GPO Box 1289, Melbourne Vic 3001, Australia

PLEASE NOTE
The information contained in this email may be confidential or privileged.
Any unauthorised use or disclosure is prohibited.  If you have received
this email in error, please delete it immediately and notify the sender by
return email. Thank you.  To the extent permitted by law, CSIRO does not
represent, warrant and/or guarantee that the integrity of this
communication has been maintained or that the communication is free of
errors, virus, interception or interference.

Please consider the environment before printing this email.
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html