Just thinking out loud here more than anything. I expect that using either find or rsync might give you the easiest path to finding what you're looking for but as long as we're thinking outside the box have you considered enabling Copy-on-Write to give you what you're looking for?
-s On Tue, Jan 11, 2011 at 11:59 AM, Aaron McCaleb <[email protected]> wrote: > Background: > ------------ > One of my tasks this year, is to produce a recommendation of backup > strategy for one department. This is for non-financial, > non-correspondence data that should not be subject to SOX > requirements, at least as far as I am aware. The possible solutions > range from an automated network backup solution and library, all the > way down to simply issuing the users half-terabyte USB disks. > > I have already noted the benefits that an automated backup solution > would provide (automation, for one...plus a history of prior versions > of a file, should the "latest" version be corrupt...leveraging offsite > storage arrangements that already exist for other data backups, etc.) > > What I am looking for, to satisfy additional information requested by > management, is suggestions on how to quantify the amount of change in > these filesystems from one time period (perhaps daily) to the next. If > much of this data is immutable, then it drastically reduces the backup > capacity required, and USB disks...or even optical media aren't > necessarily a bad choice. If it is changing often, particularly when > problems may not be noticed until after many additional changes are > also applied, then that tilts heavily in favor of a robust backup > solution, with sufficient restorable backup horizon. > > Normally the answer as to which is required is "obvious", due to > regulatory requirements or corporate policy, if nothing else. But in > this circumstance, it isn't. > > My idea, thus far (linux filesystems), is to essentially crunch ls -lR > output (with appropriate options) from the relevant filesystems at > regular intervals and store the stat() data for each file and > directory in a database...though only if it had changed from the most > recent run. For a quick, "back of the napkin" evaluation, just > comparing the number of records returned to some queries (organized by > filesystem, by date/time of scan, etc.) might provide some guidance. > But I hope to be able to delve deeper, if it seems warranted. > > > So my first question: > --------------------- > Are there already tools, preferrably OSS, that exist to help provide > this sort of information? I don't really need to know the total > amount of data is going to be subject to a backup policy. That much > is fairly straightforward to forecast. But I do need to quantify how > much of it is changing in a given time period, and how often. But > what I really need are the data usage and data modification patterns. > (Kudos if it can also catch whether the data is actually being > accessed...though the users in question are savvy enough to know how > to update the atime of all of their files, if they want to manipulate > the statistics.) > > My second question: > ------------------- > In designing something to try to collect this sort of data, what would > you try to capture? size? ctime, mtime and atime? owner, group and > permissions? (Links to past journal articles, mail list discussions, > etc., are gladly accepted, provided they aren't subscription-only > and/or out-of-print) > > I have already discounted using a checksum or md5 digest. ...only > because I don't want to impact the atime if atime actually proves > useful, and computing checksums for the volume of data in question > will probably be much too expensive. If I want to capture the > possibility that duplicate files might be scattered about, I can > probably work out the likely hood from the other information. > > Also, assuming you have data from a baseline scan and the new stat() > data for files and directories that have changed in subsequent scans, > what statistics would you believe to be most useful? > > (Note: If I do create a solution for this, I am doubtful corporate > policy will allow me to release it under any open license. But I can > certainly summarize any suggestions that are received.) > _______________________________________________ > Tech mailing list > [email protected] > https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech > This list provided by the League of Professional System Administrators > http://lopsa.org/ >
_______________________________________________ Tech mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
