On Tue, Jan 11, 2011 at 2:59 PM, Aaron McCaleb <[email protected]> wrote: > Background: > ------------ > One of my tasks this year, is to produce a recommendation of backup > strategy for one department. This is for non-financial, > non-correspondence data that should not be subject to SOX > requirements, at least as far as I am aware. The possible solutions > range from an automated network backup solution and library, all the > way down to simply issuing the users half-terabyte USB disks. > > I have already noted the benefits that an automated backup solution > would provide (automation, for one...plus a history of prior versions > of a file, should the "latest" version be corrupt...leveraging offsite > storage arrangements that already exist for other data backups, etc.) > > What I am looking for, to satisfy additional information requested by > management, is suggestions on how to quantify the amount of change in > these filesystems from one time period (perhaps daily) to the next. If > much of this data is immutable, then it drastically reduces the backup > capacity required, and USB disks...or even optical media aren't > necessarily a bad choice. If it is changing often, particularly when > problems may not be noticed until after many additional changes are > also applied, then that tilts heavily in favor of a robust backup > solution, with sufficient restorable backup horizon. > > Normally the answer as to which is required is "obvious", due to > regulatory requirements or corporate policy, if nothing else. But in > this circumstance, it isn't. > > My idea, thus far (linux filesystems), is to essentially crunch ls -lR > output (with appropriate options) from the relevant filesystems at > regular intervals and store the stat() data for each file and > directory in a database...though only if it had changed from the most > recent run. For a quick, "back of the napkin" evaluation, just > comparing the number of records returned to some queries (organized by > filesystem, by date/time of scan, etc.) might provide some guidance. > But I hope to be able to delve deeper, if it seems warranted. > > > So my first question: > --------------------- > Are there already tools, preferrably OSS, that exist to help provide > this sort of information? I don't really need to know the total > amount of data is going to be subject to a backup policy. That much > is fairly straightforward to forecast. But I do need to quantify how > much of it is changing in a given time period, and how often. But > what I really need are the data usage and data modification patterns. > (Kudos if it can also catch whether the data is actually being > accessed...though the users in question are savvy enough to know how > to update the atime of all of their files, if they want to manipulate > the statistics.) > > My second question: > ------------------- > In designing something to try to collect this sort of data, what would > you try to capture? size? ctime, mtime and atime? owner, group and > permissions? (Links to past journal articles, mail list discussions, > etc., are gladly accepted, provided they aren't subscription-only > and/or out-of-print) > > I have already discounted using a checksum or md5 digest. ...only > because I don't want to impact the atime if atime actually proves > useful, and computing checksums for the volume of data in question > will probably be much too expensive. If I want to capture the > possibility that duplicate files might be scattered about, I can > probably work out the likely hood from the other information. > > Also, assuming you have data from a baseline scan and the new stat() > data for files and directories that have changed in subsequent scans, > what statistics would you believe to be most useful? > > (Note: If I do create a solution for this, I am doubtful corporate > policy will allow me to release it under any open license. But I can > certainly summarize any suggestions that are received.)
You should take a look at BackupPC. It's open source and I think does a lot of what your asking for. It runs automatically and does a lot of deduping on the backend. I looked at it for myself and it turns out it didn't meet my needs due to too much reliance on rsync for include/exclude processing (but that's only for overly complex excludes). If you have a simple folder structure, it would work just fine. _______________________________________________ Tech mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
