On Tue, Jan 11, 2011 at 2:59 PM, Aaron McCaleb <[email protected]> wrote:
> Background:
> ------------
> One of my tasks this year, is to produce a recommendation of backup
> strategy for one department.  This is for non-financial,
> non-correspondence data that should not be subject to SOX
> requirements, at least as far as I am aware.  The possible solutions
> range from an automated network backup solution and library, all the
> way down to simply issuing the users half-terabyte USB disks.
>
> I have already noted the benefits that an automated backup solution
> would provide (automation, for one...plus a history of prior versions
> of a file, should the "latest" version be corrupt...leveraging offsite
> storage arrangements that already exist for other data backups, etc.)
>
> What I am looking for, to satisfy additional information requested by
> management, is suggestions on how to quantify the amount of change in
> these filesystems from one time period (perhaps daily) to the next. If
> much of this data is immutable, then it drastically reduces the backup
> capacity required, and USB disks...or even optical media aren't
> necessarily a bad choice.  If it is changing often, particularly when
> problems may not be noticed until after many additional changes are
> also applied, then that tilts heavily in favor of a robust backup
> solution, with sufficient restorable backup horizon.
>
> Normally the answer as to which is required is "obvious", due to
> regulatory requirements or corporate policy, if nothing else.  But in
> this circumstance, it isn't.
>
> My idea, thus far (linux filesystems), is to essentially crunch ls -lR
> output (with appropriate options) from the relevant filesystems at
> regular intervals and store the stat() data for each file and
> directory in a database...though only if it had changed from the most
> recent run.  For a quick, "back of the napkin" evaluation, just
> comparing the number of records returned to some queries (organized by
> filesystem, by date/time of scan, etc.) might provide some guidance.
> But I hope to be able to delve deeper, if it seems warranted.
>
>
> So my first question:
> ---------------------
> Are there already tools, preferrably OSS, that exist to help provide
> this sort of information?  I don't really need to know the total
> amount of data is going to be subject to a backup policy.  That much
> is fairly straightforward to forecast.  But I do need to quantify how
> much of it is changing in a given time period, and how often.  But
> what I really need are the data usage and data modification patterns.
> (Kudos if it can also catch whether the data is actually being
> accessed...though the users in question are savvy enough to know how
> to update the atime of all of their files, if they want to manipulate
> the statistics.)
>
> My second question:
> -------------------
> In designing something to try to collect this sort of data, what would
> you try to capture?  size?  ctime, mtime and atime?  owner, group and
> permissions?  (Links to past journal articles, mail list discussions,
> etc., are gladly accepted, provided they aren't subscription-only
> and/or out-of-print)
>
> I have already discounted using a checksum or md5 digest.  ...only
> because I don't want to impact the atime if atime actually proves
> useful, and computing checksums for the volume of data in question
> will probably be much too expensive.  If I want to capture the
> possibility that duplicate files might be scattered about, I can
> probably work out the likely hood from the other information.
>
> Also, assuming you have data from a baseline scan and the new stat()
> data for files and directories that have changed in subsequent scans,
> what statistics would you believe to be most useful?
>
> (Note:  If I do create a solution for this, I am doubtful corporate
> policy will allow me to release it under any open license.  But I can
> certainly summarize any suggestions that are received.)


You should take a look at BackupPC.  It's open source and I think does
a lot of what your asking for.  It runs automatically and does a lot
of deduping on the backend.  I looked at it for myself and it turns
out it didn't meet my needs due to too much reliance on rsync for
include/exclude processing (but that's only for overly complex
excludes).  If you have a simple folder structure, it would work just
fine.
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to