Re: Fwd: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-16 Thread Andrew Gideon
On Mon, 13 Jul 2015 17:38:35 -0400, Selva Nair wrote:

 As with any dedup solution, performance does take a hit and its often
 not worth it unless you have a lot of duplication in the data.

This is so only in some volumes in our case, but it appears that zfs 
permits this to be enabled/disabled on a per-volume basis.  That would 
work for us.

Is there a way to save cycles by offering zfs a hint as to where a 
previous copy of a file's blocks may be found?

- Andrew

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Fwd: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-16 Thread Ken Chase

yeah, i read somewhere that zfs DOES have separate tuning for metadata 
and data cache, but i need to read up on that more.

as for heavy block duplication: daily backups of the whole system = alot of 
dupe.

/kc


On Thu, Jul 16, 2015 at 05:42:32PM +, Andrew Gideon said:
  On Mon, 13 Jul 2015 17:38:35 -0400, Selva Nair wrote:
  
   As with any dedup solution, performance does take a hit and its often
   not worth it unless you have a lot of duplication in the data.
  
  This is so only in some volumes in our case, but it appears that zfs 
  permits this to be enabled/disabled on a per-volume basis.  That would 
  work for us.
  
  Is there a way to save cycles by offering zfs a hint as to where a 
  previous copy of a file's blocks may be found?
  
   - Andrew
  
  -- 
  Please use reply-all for most replies to avoid omitting the mailing list.
  To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-16 Thread Andrew Gideon
On Tue, 14 Jul 2015 08:59:25 +0200, Paul Slootman wrote:

 btrfs has support for this: you make a backup, then create a btrfs
 snapshot of the filesystem (or directory), then the next time you make a
 new backup with rsync, use --inplace so that just changed parts of the
 file are written to the same blocks and btrfs will take care of the
 copy-on-write part.

That's interesting.  I'd considered doing something similar with LVM 
snapshots.  I chose not to do so because of a particular failure mode: if 
the space allocated to a snapshot filled (as a result of changes to the 
live data), the snapshot would fail.  For my purposes, I'd want the new 
write to fail instead.  Destroying snapshots holding backup data didn't 
seem a reasonable choice.

How does btrfs deal with such issues?

- Andrew

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-16 Thread Simon Hobson
Andrew Gideon c182driv...@gideon.org wrote:

 btrfs has support for this: you make a backup, then create a btrfs
 snapshot of the filesystem (or directory), then the next time you make a
 new backup with rsync, use --inplace so that just changed parts of the
 file are written to the same blocks and btrfs will take care of the
 copy-on-write part.
 
 That's interesting.  I'd considered doing something similar with LVM 
 snapshots.  I chose not to do so because of a particular failure mode: if 
 the space allocated to a snapshot filled (as a result of changes to the 
 live data), the snapshot would fail.  For my purposes, I'd want the new 
 write to fail instead.  Destroying snapshots holding backup data didn't 
 seem a reasonable choice.
 
 How does btrfs deal with such issues?

I'd have expected the live write to fail. The snapshot doesn't take any space 
(well only some for filesystem data) at the point of making the snapshot.

Once the snapshot is made, then any further changes just don't change the 
snapshotted data. If you overwrite the file, then new blocks are allocated to 
it from the free pool, and the metadata updated to point to it. I believe ZFS 
works in the same way.
The only difference in fact is that without the snapshot, after the new file 
has been written, the old version is freed and the space returned to the free 
pool.


Andrew Gideon c182driv...@gideon.org wrote:

 Is there a way to save cycles by offering zfs a hint as to where a 
 previous copy of a file's blocks may be found?

I would assume (and note that it is an assumption) is that rsync will only 
write the blocks it needs to. It's checksummed the file chunk by chunk - it 
only transferred changed chunks, and I assume that if you use the in-place 
option it shouldn't need to re-write the whole file.

So say you have a file with 5 blocks, stored in blocks ABCDE on the disk. You 
snapshot the volume, and update block3 of the file - you should now have a 
snapshot file in blocks ABCDE, and a live file in blocks ABFDE, with blocks 
ABDE shared.

With the caveat that I've not really studied this, but I have read a little and 
listened to presentations. I would really hope that both filesystems work that 
way.


-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-14 Thread Paul Slootman
On Mon 13 Jul 2015, Andrew Gideon wrote:
 
 On the other hand, I do confess that I am sometimes miffed at the waste 
 involved in a small change to a very large file.  Rsync is smart about 
 moving minimal data, but it still stores an entire new copy of the file.
 
 What's needed is a file system that can do what hard links do, but at the 
 file page level.  I imagine that this would work using the same Copy On 
 Write logic used in managing memory pages after a fork().

btrfs has support for this: you make a backup, then create a btrfs
snapshot of the filesystem (or directory), then the next time you make a
new backup with rsync, use --inplace so that just changed parts of the
file are written to the same blocks and btrfs will take care of the
copy-on-write part.


Paul

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-14 Thread Ken Chase
And what's performance like? I've heard lots of COW systems performance
drops through the floor when there's many snapshots.

/kc


On Tue, Jul 14, 2015 at 08:59:25AM +0200, Paul Slootman said:
  On Mon 13 Jul 2015, Andrew Gideon wrote:
   
   On the other hand, I do confess that I am sometimes miffed at the waste 
   involved in a small change to a very large file.  Rsync is smart about 
   moving minimal data, but it still stores an entire new copy of the file.
   
   What's needed is a file system that can do what hard links do, but at the 
   file page level.  I imagine that this would work using the same Copy On 
   Write logic used in managing memory pages after a fork().
  
  btrfs has support for this: you make a backup, then create a btrfs
  snapshot of the filesystem (or directory), then the next time you make a
  new backup with rsync, use --inplace so that just changed parts of the
  file are written to the same blocks and btrfs will take care of the
  copy-on-write part.
  
  
  Paul
  
  -- 
  Please use reply-all for most replies to avoid omitting the mailing list.
  To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-14 Thread Simon Hobson
Ken Chase rsync-list-m...@sizone.org wrote:

 And what's performance like? I've heard lots of COW systems performance
 drops through the floor when there's many snapshots.

For BTRFS I'd suspect the performance penalty to be fairly small. Snapshots can 
be done in different ways, and the way BTRFS and (I think) ZFS do it is 
actually quite elegant.

Some systems keep a current state, and separate files for the snapshots 
(effectively a list of the differences from the current version). The 
performance hit comes when you update the current state, but before writing a 
chunk, the previous current version of the chunk must be read and added to the 
snapshot(s) that include it.

I believe the way BTRFS and XFS do it is far more elegant. When you write a 
file out, you stuff the data in a number of disk blocks, and write an entry 
into the filesystem structures to say where that data is stored.
In BTRFS, when you do a snapshot, it just notes that you've done it and at 
that point very little happens.
When you then modify a file, instead of writing the data to the same blocks on 
disk, it's written to empty space, the old version is left in place, and the 
filesystem structures are updated to account for there now being two versions. 
If you only write some blocks of the file, I'd assume that only those new 
blocks would get the COW treatment.
So the only overhead is in allocating new space to the file, and keeping two 
versions of the file allocation map.
When you delete a snapshot, all it does is delete the snapshotted versions of 
the filesystem state data and mark any freed space as free.

The only downside I see of the BTRFS way of doing it is that you'll get more 
file fragmentation. But TBH, does fragmentation really make that much 
difference on most real systems these days ?


-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-13 Thread Andrew Gideon
On Mon, 13 Jul 2015 02:19:23 +, Andrew Gideon wrote:

 Look at tools like inotifywait, auditd, or kfsmd to see what's easily
 available to you and what best fits your needs.
 
 [Though I'd also be surprised if nobody has fed audit information into
 rsync before; your need doesn't seem all that unusual given ever-growing
 disk storage.]

I wanted to take this a bit further.  I've thought, on and off, about 
this for a while and I always get stuck.

I use rsync with --link-desk as a backup tool.  For various reasons, this 
is not something I want to give up.  But, esp. for some very large file 
systems, doing something that avoids the scan would be desirable.

I should also add that I mistrust time-stamp, and even time-stamp+file-
size, mechanism for detecting changes.  Checksums, on the other hand, are 
prohibitively expensive for backup of large file systems.

These both bring me to the idea of using some file system auditing 
mechanism to drive - perhaps with an --include-from or --files-from - 
what rsync moves.

Where I get stuck is that I cannot envision how I can provide rsync with 
a limited list of files to move that doesn't deny the benefit of --link-
dest: a complete snapshot of the old file system via [hard] links into a 
prior snapshot for those files that are unchanged.

Has anyone done something of this sort?  I'd thought of preceding the 
rsync with a cp -Rl on the destination from the old snapshot to the new 
snapshot, but I still think that this will break in the face of hard 
links (to a file not in the --files-from list) or a change to file 
attributes (ie. a chmod would effect the copy of a file in the old 
snapshot).

Thanks...

Andrew

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-13 Thread Simon Hobson
Andrew Gideon c182driv...@gideon.org wrote:

 These both bring me to the idea of using some file system auditing 
 mechanism to drive - perhaps with an --include-from or --files-from - 
 what rsync moves.
 
 Where I get stuck is that I cannot envision how I can provide rsync with 
 a limited list of files to move that doesn't deny the benefit of --link-
 dest: a complete snapshot of the old file system via [hard] links into a 
 prior snapshot for those files that are unchanged.

The think here is that you are into backup tools rather than the general 
purpose tool that rsync is intended to be.

storebackup does some elements of what you talk about in that it keeps a 
catalogue of existing files in the backup with a hash/checksum for each. I'm 
not sure how it goes about picking changed files - I suspect it uses 
time+size as a primary filter, but on the other hand I know for a fact you 
can touch a file and that change won't appear in the destination*.
But for remote backups, the primary server can generate a changes list which is 
then copied to the remote server which then adds the new/changed files and 
hard-links the unchanged ones according to the list it's been given.
If you turn off the file splitting and compression options, the backup is a 
series of hard-linked directories which you can look into and pull files 
directly.
* But if you do alter the timestamp on a file without changing the contents, 
that will not appear in the file structure in the backup - later copies of 
the file retain the earlier timestamp. It does keep this information, and if 
you use the corresponding restore tool then you get back the correct timestamp.


In a completely different setup, I also use Retrospect. Recent versions have an 
option (Instant Scan) to allow the client to keep an audit of changes to avoid 
the scan the client/do a massive compare that's needed with this option 
turned off.


-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-13 Thread Ken Chase
inotifywatch or equiv, there's FSM stuff (filesystem monitor) as well.

constantData had a product we used years ago - a kernel module that dumped
out a list of any changed files out some /proc or /dev/* device and they
had a whole toolset that ate the list (into some db) and played it out
as it constantly tried to keep up with replication to a target (kinda like
drdb but async). They got eaten by some large backup company and the product
was later priced at 5x what we had paid for it (in the mid $x000s/y)

This 2003-4 technolog is certainly available in some format now.

If you only copy the changes, you're likely saving a lot of time.

/kc


On Mon, Jul 13, 2015 at 01:53:43PM +, Andrew Gideon said:
  On Mon, 13 Jul 2015 02:19:23 +, Andrew Gideon wrote:
  
   Look at tools like inotifywait, auditd, or kfsmd to see what's easily
   available to you and what best fits your needs.
   
   [Though I'd also be surprised if nobody has fed audit information into
   rsync before; your need doesn't seem all that unusual given ever-growing
   disk storage.]
  
  I wanted to take this a bit further.  I've thought, on and off, about 
  this for a while and I always get stuck.
  
  I use rsync with --link-desk as a backup tool.  For various reasons, this 
  is not something I want to give up.  But, esp. for some very large file 
  systems, doing something that avoids the scan would be desirable.
  
  I should also add that I mistrust time-stamp, and even time-stamp+file-
  size, mechanism for detecting changes.  Checksums, on the other hand, are 
  prohibitively expensive for backup of large file systems.
  
  These both bring me to the idea of using some file system auditing 
  mechanism to drive - perhaps with an --include-from or --files-from - 
  what rsync moves.
  
  Where I get stuck is that I cannot envision how I can provide rsync with 
  a limited list of files to move that doesn't deny the benefit of --link-
  dest: a complete snapshot of the old file system via [hard] links into a 
  prior snapshot for those files that are unchanged.
  
  Has anyone done something of this sort?  I'd thought of preceding the 
  rsync with a cp -Rl on the destination from the old snapshot to the new 
  snapshot, but I still think that this will break in the face of hard 
  links (to a file not in the --files-from list) or a change to file 
  attributes (ie. a chmod would effect the copy of a file in the old 
  snapshot).
  
  Thanks...
  
   Andrew
  
  -- 
  Please use reply-all for most replies to avoid omitting the mailing list.
  To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-13 Thread Simon Hobson
Andrew Gideon c182driv...@gideon.org wrote:

 However, you've made be a little 
 apprehensive about storebackup.  I like the lack of a need for a restore 
 tool.  This permits all the standard UNIX tools to be applied to 
 whatever I might want to do over the backup, which is often *very* 
 convenient.

Well if you don't use the file splitting and compression options, you can still 
do that with storebackup - just be aware that some files may have different 
timestamps (but not contents) to the original. Specifically, consider this 
sequence :
- Create a file, perform a backup
- touch the file to change it's modification timestamp, perform another backup
rsync will (I think) see the new file with different timestamp and create a new 
file rather than lining to the old one.
storebackup will link the files )so taking (almost) zero extra space - but the 
second backup will show the file with the timestamp from the first file. If you 
just cp -p the file then it'll have the earlier timestamp, if you restore it 
with the storebackup tools then it'll come out with the later timestamp.

 On the other hand, I do confess that I am sometimes miffed at the waste 
 involved in a small change to a very large file.  Rsync is smart about 
 moving minimal data, but it still stores an entire new copy of the file.

I'm not sure as I've not used it, but storebackup has the option of splitting 
large files (threshold user definable). You'd need to look and see if it 
compares file parts (hard-lining unchanged parts) or the whole file (creates 
all new parts).

 What's needed is a file system that can do what hard links do, but at the 
 file page level.  I imagine that this would work using the same Copy On 
 Write logic used in managing memory pages after a fork().

Well some (all ?) enterprise grade storage boxes support de-dup - usually at 
the block level. So it does exist, at a price !


-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Fwd: rsync --link-dest and --files-from lead by a change list from some file system audit tool (Was: Re: cut-off time for rsync ?)

2015-07-13 Thread Selva Nair
On Mon, Jul 13, 2015 at 5:19 PM, Simon Hobson li...@thehobsons.co.uk
wrote:

  What's needed is a file system that can do what hard links do, but at the
  file page level.  I imagine that this would work using the same Copy On
  Write logic used in managing memory pages after a fork().

 Well some (all ?) enterprise grade storage boxes support de-dup - usually
 at the block level. So it does exist, at a price !



zfs is free and has de-dup. It takes more RAM to support it well,  but not
prohibitively so unless your data is more than a few TB. As with any dedup
solution, performance does take a hit and its often not worth it unless you
have a lot of duplication in the data.

Selva
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: cut-off time for rsync ?

2015-07-12 Thread Andrew Gideon
On Thu, 02 Jul 2015 20:57:06 +1200, Mark wrote:

 You could use find to build a filter to use with rsync, then update the
 filter every few days if it takes too long to create.

If you're going to do something of that sort, you might want instead to 
consider truly tracking changes.  This catches operations that find will 
miss, such as deletes, renames, copies preserving timestamp (cp -
p ...), and probably other operations not coming to mind at the moment.

Look at tools like inotifywait, auditd, or kfsmd to see what's easily 
available to you and what best fits your needs.

[Though I'd also be surprised if nobody has fed audit information into 
rsync before; your need doesn't seem all that unusual given ever-growing 
disk storage.]

In addition to catching operations that a find would miss, this also 
avoids the cost of scanning file systems which is the immediate need 
being discussed.  On the other hand, this isn't free either.  I imagine 
that there's some crossover point on one side of which scanning is better 
and on the other auditing is better.

- Andrew

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: cut-off time for rsync ?

2015-07-03 Thread Simon Hobson
Ken Chase rsync-list-m...@sizone.org wrote:

 You have NO IDEA how long it takes to scan 100M files
 on a 7200 rpm disk.

Actually I do have some idea !

 Additionally, I dont know if linux (or freebsd or any unix) can be told to 
 cache
 metadata more aggressively than data

That had gone through my mind - how much RAM do you have in the backup system ? 
Also what other options do you use - I've found some of them (especially 
hard-links) can have a significant impact on performance.

Otherwise, have you looked at StoreBackup ?
It's probably somewhat more than you are after, *but* it does have a mode 
specifically for efficient transfer of backups from one system to another. I've 
been using it for a few years for my backups (keeping multiple backups etc) but 
haven't used the remote transfer bit yet.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-07-02 Thread Dirk van Deun
 What is taking time, scanning inodes on the destination, or recopying the 
 entire
 backup because of either source read speed, target write speed or a slow 
 interconnect
 between them?

It takes hours to traverse all these directories with loads of small
files on the backup server.  That is the limiting factor.  Not
even copying: just checking the timestamp and size of the old copies.

The source server is the actual live system, which has fast disks,
so I can afford to move the burden to the source side, using the find
utility to select homes that have been touched recently and using
rsync only on these.

But it would be nice if a clever invocation of rsync could remove the
extra burden entirely.

Dirk van Deun
-- 
Ceterum censeo Redmond delendum
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-07-02 Thread Mark
You could use find to build a filter to use with rsync, then update the 
filter every few days if it takes too long to create.


I have used a script to build a filter on the source server to exclude 
anything over 5 days old, invoked when the sync starts, but it only 
parses around 2000 files per run.


Mark.


On 2/07/2015 2:34 a.m., Ken Chase wrote:

What is taking time, scanning inodes on the destination, or recopying the entire
backup because of either source read speed, target write speed or a slow 
interconnect
between them?

Do you keep a full new backup every day, or are you just overwriting the target
directory?

/kc


On Wed, Jul 01, 2015 at 10:06:57AM +0200, Dirk van Deun said:
If your goal is to reduce storage, and scanning inodes doesnt matter,
use --link-dest for targets. However, that'll keep a backup for every
time that you run it, by link-desting yesterday's copy.
   
   The goal was not to reduce storage, it was to reduce work.  A full
   rsync takes more than the whole night, and the destination server is
   almost unusable for anything else when it is doing its rsyncs.  I
   am sorry if this was unclear.  I just want to give rsync a hint that
   comparing files and directories that are older than one week on
   the source side is a waste of time and effort, as the rsync is done
   every day, so they can safely be assumed to be in sync already.
   
   Dirk van Deun
   --
   Ceterum censeo Redmond delendum



--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-07-02 Thread Ken Chase
Yes if rsync could keep a 'last state file' that'd be great, which would
require the target be unchanged by any other process/usage - this is however
the case with many of our uses here - as a backup only target.

Then it could just load the target statefile, and only scan the source
for changes vs the last-state file. 

Cant think of any way around this issue with rsync alone without some external
parsing of previous logs, etc. 

This is unfortunately why I never use 5400/5900 rpm disks on my backup targets,
and use raid 10 not 5, for speed. Little more $ in the end, but necessary
to scan 50-80M inodes per night in my ~6hr backup window.

/kc


On Thu, Jul 02, 2015 at 11:43:37AM +0200, Dirk van Deun said:
   What is taking time, scanning inodes on the destination, or recopying the 
entire
   backup because of either source read speed, target write speed or a slow 
interconnect
   between them?
  
  It takes hours to traverse all these directories with loads of small
  files on the backup server.  That is the limiting factor.  Not
  even copying: just checking the timestamp and size of the old copies.
  
  The source server is the actual live system, which has fast disks,
  so I can afford to move the burden to the source side, using the find
  utility to select homes that have been touched recently and using
  rsync only on these.
  
  But it would be nice if a clever invocation of rsync could remove the
  extra burden entirely.
  
  Dirk van Deun
  -- 
  Ceterum censeo Redmond delendum

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-07-02 Thread Ken Chase
On Wed, Jul 01, 2015 at 02:05:50PM +0100, Simon Hobson said:

  As I read this, the default is to look at the file size/timestamp and if
  they match then do nothing as they are assumed to be identical. So unless
  you have specified this, then files which have already been copied should be
  ignored - the check should be quite low in CPU, at least compared to the
  cost of generating a file checksum etc.

This belies the issue of many rsync users not sufficiently abusing rsync to do
backups like us idiots do! :) You have NO IDEA how long it takes to scan 100M 
files
on a 7200 rpm disk. It becomes the dominant issue - CPU isnt the issue at all.
(Additionally, I would think that metadata scanning could max out only 2 cores
anyway - 1 for rsync's userland gobbling of another core of kernel running the
fs scanning inodes).

This is why throwing away all that metadata seems silly. Keeping detailed logs
and parsing them before copy would be good, but requires an external selection
script before rsync starts, the script handing rsync a list of files to copy
directly. Unfortunate because rsync's scan method is quite advanced, but doesnt
avoid this pitfall.

Additionally, I dont know if linux (or freebsd or any unix) can be told to cache
metadata more aggressively than data - not much point for the latter on a backup
server. The former would be great. I dont know how big metadata is in ram either
for typical OS's, per inode.

/kc
-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-07-01 Thread Ken Chase
What is taking time, scanning inodes on the destination, or recopying the entire
backup because of either source read speed, target write speed or a slow 
interconnect
between them?

Do you keep a full new backup every day, or are you just overwriting the target
directory?

/kc


On Wed, Jul 01, 2015 at 10:06:57AM +0200, Dirk van Deun said:
   If your goal is to reduce storage, and scanning inodes doesnt matter,
   use --link-dest for targets. However, that'll keep a backup for every
   time that you run it, by link-desting yesterday's copy.
   
  The goal was not to reduce storage, it was to reduce work.  A full
  rsync takes more than the whole night, and the destination server is
  almost unusable for anything else when it is doing its rsyncs.  I
  am sorry if this was unclear.  I just want to give rsync a hint that
  comparing files and directories that are older than one week on
  the source side is a waste of time and effort, as the rsync is done
  every day, so they can safely be assumed to be in sync already.
  
  Dirk van Deun
  -- 
  Ceterum censeo Redmond delendum

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-07-01 Thread Dirk van Deun
 I used to rsync a /home with thousands of home directories every
 night, although only a hundred or so would be used on a typical day,
 and many of them have not been used for ages.  This became too large a
 burden on the poor old destination server, so I switched to a script
 that uses find -ctime -7 on the source to select recently used homes
 first, and then rsyncs only those.  (A week being a more than good
 enough safety margin in case something goes wrong occasionally.)
 
 Doing it this way you can't delete files that have disappeared or been
 renamed.
 
 Is there a smarter way to do this, using rsync only ?  I would like to
 use rsync with a cut-off time, saying if a file is older than this,
 don't even bother checking it on the destination server (and the same
 for directories -- but without ending a recursive traversal).  Now
 I am traversing some directories twice on the source server to lighten
 the burden on the destination server (first find, then rsync).
 
 I would split up the tree into several sub trees and snyc them
 normally, like /home/a* etc. You can then distribute the calls
 over several days. If that is still too much then maybe to the
 find call but then sync the whole user's home instead of just
 the found files.

As I did say in my original mail, but apparently did not emphasize
sufficiently, rsyncing complete homes if anything changed in them is
actually what I do; so files that have been deleted or renamed are
handled correctly.  Anyway, the first paragraph was just to provide
some context: my real question is: can you specify a cut-off time
using rsync only, meaning that files are ignored and directories are
considered up to date on the destination server if they have not
been touched for x days on the source ?

Dirk van Deun
-- 
Ceterum censeo Redmond delendum
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-07-01 Thread Dirk van Deun
 If your goal is to reduce storage, and scanning inodes doesnt matter,
 use --link-dest for targets. However, that'll keep a backup for every
 time that you run it, by link-desting yesterday's copy.
 
The goal was not to reduce storage, it was to reduce work.  A full
rsync takes more than the whole night, and the destination server is
almost unusable for anything else when it is doing its rsyncs.  I
am sorry if this was unclear.  I just want to give rsync a hint that
comparing files and directories that are older than one week on
the source side is a waste of time and effort, as the rsync is done
every day, so they can safely be assumed to be in sync already.

Dirk van Deun
-- 
Ceterum censeo Redmond delendum
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-06-30 Thread Fabian Cenedese
At 10:32 30.06.2015, Dirk van Deun wrote:
Hi,

I used to rsync a /home with thousands of home directories every
night, although only a hundred or so would be used on a typical day,
and many of them have not been used for ages.  This became too large a
burden on the poor old destination server, so I switched to a script
that uses find -ctime -7 on the source to select recently used homes
first, and then rsyncs only those.  (A week being a more than good
enough safety margin in case something goes wrong occasionally.)

Doing it this way you can't delete files that have disappeared or been
renamed.

Is there a smarter way to do this, using rsync only ?  I would like to
use rsync with a cut-off time, saying if a file is older than this,
don't even bother checking it on the destination server (and the same
for directories -- but without ending a recursive traversal).  Now
I am traversing some directories twice on the source server to lighten
the burden on the destination server (first find, then rsync).

I would split up the tree into several sub trees and snyc them
normally, like /home/a* etc. You can then distribute the calls
over several days. If that is still too much then maybe to the
find call but then sync the whole user's home instead of just
the found files.

bye  Fabi

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: cut-off time for rsync ?

2015-06-30 Thread Ken Chase
If your goal is to reduce storage, and scanning inodes doesnt matter,
use --link-dest for targets. However, that'll keep a backup for every
time that you run it, by link-desting yesterday's copy.

Y end up with a backup tree dir per day, with files hardlinked against
all other backup dirs. My (and many others) here's solution is to

mv $ancientbackup $today; rsync --del --link-dest=$yest source:$dirs $today 

creating gaps in the ancient sequence of days of backups - so I end up
keeping (very roughly) 1,2,3,4,7,10,15,21,30,45,60,90,120,180 days old backups
(of course this isnt how it works, there's some binary counting going on in 
there,
so the elimination isnt exactly like that - every day each of those gets a day 
older.
There are some tower of hanoi-like solutions to this for automated backups.)

This means something twice as old has twice as few backups for the same time 
range,
meaning I keep the same frequency*age value for each backup timerange into the 
past.

The result is a set of dirs dated (in my case) 20150630 for eg, which looks
exactly like the actual source tree i backed up, but only taking up space of
changed files since yesterday. (caveat: it's hardlinked against all the other
backups, thus using no more space on disk HOWEVER, some server stuff like
postfix doenst like hardlinked files in its spool due to security concerns -
so if you should boot/use the backup itself without making a plain copy (which
is recommended) 1) postfix et al will yell 2) you will be modifying the whole
set of dirs that point to the inode you just booted/used).

My solution avoids scanning the source twice (which in my case of backing up
5x 10M files off servers daily is a huge cost), important because the scantime
takes longer than the backup/xfer time (gigE network for a mere 20,000 changed
files per 10M seems average per box of 5). Also it's production gear - as
little time as possible thrashing the box (and its poor metadata cache) is
important for performance. Getting the backups done during the night lull is
therefore required. I dont have time to delete (nor the disk RMA cycle
patience) 10M files on the receiving side just to spend 5 hours recreating
them; 20,000 seems better to me.

You could also use --backup and --backup-dir, but I dont do it that way.

/kc


On Tue, Jun 30, 2015 at 10:32:31AM +0200, Dirk van Deun said:
  Hi,
  
  I used to rsync a /home with thousands of home directories every
  night, although only a hundred or so would be used on a typical day,
  and many of them have not been used for ages.  This became too large a
  burden on the poor old destination server, so I switched to a script
  that uses find -ctime -7 on the source to select recently used homes
  first, and then rsyncs only those.  (A week being a more than good
  enough safety margin in case something goes wrong occasionally.)
  
  Is there a smarter way to do this, using rsync only ?  I would like to
  use rsync with a cut-off time, saying if a file is older than this,
  don't even bother checking it on the destination server (and the same
  for directories -- but without ending a recursive traversal).  Now
  I am traversing some directories twice on the source server to lighten
  the burden on the destination server (first find, then rsync).
  
  Best,
  
  Dirk van Deun
  -- 
  Ceterum censeo Redmond delendum
  -- 
  Please use reply-all for most replies to avoid omitting the mailing list.
  To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html