For backup-focused churn, you'd mostly going to want size and mtime, as those are most going to affect your backup window.
I'd expect that the metric you'd probably want to start with is bytes of files changed per day/week/month. Unless you're using a block-based backup system that can do incrementals at the sub-file level, a changed files means backing up the entire file again, so just add up the sizes of all of the files changed at least one each day/week/month, and you have your metric. As for code to do this, you could certainly build something yourself, or just run something like tripwire and calculate a churn level off of the changed file report that it generates each day. --Ted On 1/11/2011 11:59 AM, Aaron McCaleb wrote: > Background: > ------------ > One of my tasks this year, is to produce a recommendation of backup > strategy for one department. This is for non-financial, > non-correspondence data that should not be subject to SOX > requirements, at least as far as I am aware. The possible solutions > range from an automated network backup solution and library, all the > way down to simply issuing the users half-terabyte USB disks. > > I have already noted the benefits that an automated backup solution > would provide (automation, for one...plus a history of prior versions > of a file, should the "latest" version be corrupt...leveraging offsite > storage arrangements that already exist for other data backups, etc.) > > What I am looking for, to satisfy additional information requested by > management, is suggestions on how to quantify the amount of change in > these filesystems from one time period (perhaps daily) to the next. If > much of this data is immutable, then it drastically reduces the backup > capacity required, and USB disks...or even optical media aren't > necessarily a bad choice. If it is changing often, particularly when > problems may not be noticed until after many additional changes are > also applied, then that tilts heavily in favor of a robust backup > solution, with sufficient restorable backup horizon. > > Normally the answer as to which is required is "obvious", due to > regulatory requirements or corporate policy, if nothing else. But in > this circumstance, it isn't. > > My idea, thus far (linux filesystems), is to essentially crunch ls -lR > output (with appropriate options) from the relevant filesystems at > regular intervals and store the stat() data for each file and > directory in a database...though only if it had changed from the most > recent run. For a quick, "back of the napkin" evaluation, just > comparing the number of records returned to some queries (organized by > filesystem, by date/time of scan, etc.) might provide some guidance. > But I hope to be able to delve deeper, if it seems warranted. > > > So my first question: > --------------------- > Are there already tools, preferrably OSS, that exist to help provide > this sort of information? I don't really need to know the total > amount of data is going to be subject to a backup policy. That much > is fairly straightforward to forecast. But I do need to quantify how > much of it is changing in a given time period, and how often. But > what I really need are the data usage and data modification patterns. > (Kudos if it can also catch whether the data is actually being > accessed...though the users in question are savvy enough to know how > to update the atime of all of their files, if they want to manipulate > the statistics.) > > My second question: > ------------------- > In designing something to try to collect this sort of data, what would > you try to capture? size? ctime, mtime and atime? owner, group and > permissions? (Links to past journal articles, mail list discussions, > etc., are gladly accepted, provided they aren't subscription-only > and/or out-of-print) > > I have already discounted using a checksum or md5 digest. ...only > because I don't want to impact the atime if atime actually proves > useful, and computing checksums for the volume of data in question > will probably be much too expensive. If I want to capture the > possibility that duplicate files might be scattered about, I can > probably work out the likely hood from the other information. > > Also, assuming you have data from a baseline scan and the new stat() > data for files and directories that have changed in subsequent scans, > what statistics would you believe to be most useful? > > (Note: If I do create a solution for this, I am doubtful corporate > policy will allow me to release it under any open license. But I can > certainly summarize any suggestions that are received.) > _______________________________________________ > Tech mailing list > [email protected] > https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech > This list provided by the League of Professional System Administrators > http://lopsa.org/ > _______________________________________________ Tech mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
