Background:
------------
One of my tasks this year, is to produce a recommendation of backup
strategy for one department.  This is for non-financial,
non-correspondence data that should not be subject to SOX
requirements, at least as far as I am aware.  The possible solutions
range from an automated network backup solution and library, all the
way down to simply issuing the users half-terabyte USB disks.

I have already noted the benefits that an automated backup solution
would provide (automation, for one...plus a history of prior versions
of a file, should the "latest" version be corrupt...leveraging offsite
storage arrangements that already exist for other data backups, etc.)

What I am looking for, to satisfy additional information requested by
management, is suggestions on how to quantify the amount of change in
these filesystems from one time period (perhaps daily) to the next. If
much of this data is immutable, then it drastically reduces the backup
capacity required, and USB disks...or even optical media aren't
necessarily a bad choice.  If it is changing often, particularly when
problems may not be noticed until after many additional changes are
also applied, then that tilts heavily in favor of a robust backup
solution, with sufficient restorable backup horizon.

Normally the answer as to which is required is "obvious", due to
regulatory requirements or corporate policy, if nothing else.  But in
this circumstance, it isn't.

My idea, thus far (linux filesystems), is to essentially crunch ls -lR
output (with appropriate options) from the relevant filesystems at
regular intervals and store the stat() data for each file and
directory in a database...though only if it had changed from the most
recent run.  For a quick, "back of the napkin" evaluation, just
comparing the number of records returned to some queries (organized by
filesystem, by date/time of scan, etc.) might provide some guidance.
But I hope to be able to delve deeper, if it seems warranted.


So my first question:
---------------------
Are there already tools, preferrably OSS, that exist to help provide
this sort of information?  I don't really need to know the total
amount of data is going to be subject to a backup policy.  That much
is fairly straightforward to forecast.  But I do need to quantify how
much of it is changing in a given time period, and how often.  But
what I really need are the data usage and data modification patterns.
(Kudos if it can also catch whether the data is actually being
accessed...though the users in question are savvy enough to know how
to update the atime of all of their files, if they want to manipulate
the statistics.)

My second question:
-------------------
In designing something to try to collect this sort of data, what would
you try to capture?  size?  ctime, mtime and atime?  owner, group and
permissions?  (Links to past journal articles, mail list discussions,
etc., are gladly accepted, provided they aren't subscription-only
and/or out-of-print)

I have already discounted using a checksum or md5 digest.  ...only
because I don't want to impact the atime if atime actually proves
useful, and computing checksums for the volume of data in question
will probably be much too expensive.  If I want to capture the
possibility that duplicate files might be scattered about, I can
probably work out the likely hood from the other information.

Also, assuming you have data from a baseline scan and the new stat()
data for files and directories that have changed in subsequent scans,
what statistics would you believe to be most useful?

(Note:  If I do create a solution for this, I am doubtful corporate
policy will allow me to release it under any open license.  But I can
certainly summarize any suggestions that are received.)
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to