RE: File::Redundant

2002-04-29 Thread Cahill, Earl

 Interesting ... not sure if implementing this in this fashion 
 would be 
 worth the overhead.  If such a need exists I would imagine 
 that would have 
 choosen a more appropriate OS level solution.  Think OpenAFS.

It is always nice to use stuff that has ibm backing and likely has at least
a professor or two and some grad students helping out on it.  I had never
heard of OpenAFS before your email.  I will have to look into it a bit.  My
stuff would hopefully make it nice if you didn't want to change your os, or
if you just wanted to make File::Redundant a small part of a much larger
overall system.

The biggest overheard I have seen is having to do readilnks.  Maybe I could
get around them somehow.  I will have to draw up some uml or something to
show how my whole system works.

Earl



RE: File::Redundant

2002-04-29 Thread Cahill, Earl

 I would think it could be useful in non-mod_perl applications as well
 - you give an example of a user's mailbox.  With scp it might be even
 more fun to have around :)  (/me is thinking of config files and
 such)

mod_perl works very well with the system for keeping track of what boxes are
down, sizes of partitions and the like.  However, a simple daemon would do
about the same thing for say non-web based mail stuff.  When I release I
will likely have a daemon version as well as the mod_perl version, just
using Net::Server.

 What's a `very large amount of data' ?

We use it for tens of thousands of files, but most of those are small, and
they certainly are all small on the 3 GB range.  That is sort of the model
for dirsync I think.  Lots of small files in lots of different directories.

 Our NIS maps are on the order
 of 3 GB per file (64k users).

Man, that is one big file.  Guess dropping a note to this list sorta lets
you know what you have to really scale to.  Sounds like dirsync could use
rsync if Rob makes a couple changes.  Can't believe the file couldn't be
broken up into smaller files.  3 GB for 64k users doesn't scale so hot for
say a million users, but I have no idea about NIS maps, so there you go.

Earl



Re: File::Redundant

2002-04-29 Thread darren chamberlain

This is OT for mod_perl, sorry...

* Cahill, Earl [EMAIL PROTECTED] [2002-04-29 13:55]:
  Our NIS maps are on the order
  of 3 GB per file (64k users).
 
 Man, that is one big file.  Guess dropping a note to this list sorta
 lets you know what you have to really scale to.  Sounds like dirsync
 could use rsync if Rob makes a couple changes.  Can't believe the file
 couldn't be broken up into smaller files.  3 GB for 64k users doesn't
 scale so hot for say a million users, but I have no idea about NIS
 maps, so there you go.

I haven't been following the conversation, for the most part, but this
part caught my eye.  It is possible to split a NIS map up into many
small source files, as long as when you change one of them you recreate
the map in question as a whole.

I've seen places with large NIS maps (although not 3GB) split the map up
into smaller files, where each letter of the alphabet has it's own file
in a designated subdirectory and a UID generator is used to get the next
UID.  When the NIS maps have to be rebuilt, the main map file is
rebuilt using something like:

  (cat passwd.files/[a-z]*)  passwd; make passwd

which, of course, could be added to the Makefile as part of the passwd
target.

(darren)

-- 
OCCAM'S ERASER:
The philosophical principle that even the simplest solution is bound
to have something wrong with it.



File::Redundant

2002-04-25 Thread Cahill, Earl

Just putting about a little feeler about this package I started writing last
night.  Wondering about its usefulness, current availability, and just
overall interest.  Designed for mod_perl use.  Doesn't make much sense
otherwise.

Don't want to go into too many details here, but File::Redundant, takes some
unique word (hopefully guaranteed through a database: a mailbox, a username,
a website, etc.) which I call a thing, a pool of dirs, and how many $copies
you would like to maintain.  From the pool of dirs, $copies good dirs are
chosen, ordered by percent full on the given partition.
When you open a file with my open method (along with close, this is the only
override method I have written so far), you get a file handle.  Do what you
like on the file handle.  When you close the file handle, with my close
method, I CORE::close the file and use Rob Brown's File::DirSync to sync to
all the directories.  DirSync uses time stamps to very quickly sync changes
between directory trees.
When a dir can't be reached (box is down or what have you), $copies good
dirs are re-chosen and the dirsync happens from good old data to the new
good dirs.  If too much stuff goes down, you're sorta outta luck, but you
would have been without my system anyway.
I would write methods for everything (within reason) you do to a file, open,
close, unlink, rename, stat, etc.
So who cares?  Well, using this system would make it quite easy to keep
track of really an arbitrarily large amount of data.  The pool of dirs could
be mounts from any number of boxes, located remotely or otherwise, and you
could sync accordingly.  If File::DirSync gets to the point where you can
use ftp or scp, all the better.
There are race conditions all over the place, and I plan on
transactionalizing where I can.  The whole system depends on how long the
dirsync takes.  In my experience, dirsync is very fast.  Likely I would have
dirsync'ing daemon(s), dirsync'ing as fast as they can.  In some best case
scenario, the most data that would ever get lost would be the time it takes
to do one dirsync (usually less than a second for even very large amounts of
data), and the loss would only happen if you were making changes on a dir as
the dir went down.  I would try to deal with boxes coming back up and
keeping everything clean as best I could.
So, it would be a work in progress, and hopefully get better as I went, but
I would at least like to give it a shot.
Earl



Re: File::Redundant

2002-04-25 Thread James G Smith

Cahill, Earl [EMAIL PROTECTED] wrote:
Just putting about a little feeler about this package I started writing last
night.  Wondering about its usefulness, current availability, and just
overall interest.  Designed for mod_perl use.  Doesn't make much sense
otherwise.

I would think it could be useful in non-mod_perl applications as well
- you give an example of a user's mailbox.  With scp it might be even
more fun to have around :)  (/me is thinking of config files and
such)

transactionalizing where I can.  The whole system depends on how long the
dirsync takes.  In my experience, dirsync is very fast.  Likely I would have
dirsync'ing daemon(s), dirsync'ing as fast as they can.  In some best case
scenario, the most data that would ever get lost would be the time it takes
to do one dirsync (usually less than a second for even very large amounts of
data), and the loss would only happen if you were making changes on a dir as
the dir went down.  I would try to deal with boxes coming back up and
keeping everything clean as best I could.

What's a `very large amount of data' ?  Our NIS maps are on the order
of 3 GB per file (64k users).  Over a gigabit ethernet link, this
still takes half a minute or so to copy to a remote system, at least
(for NIS master-slave copies) -- this is just an example of a very
large amount of data being sync'd over a network.  I don't see how
transferring at least 3 GB of data can be avoided (even with diffs,
the bits being diff'd have to be present in the same CPU at the same
time).  If any of the directories being considered by your module are
NFS mounted, this will be an issue.

Personally, I see NFS mounting as a real possibility since that
allows relatively easy maintenance of a remote copy for backup if
nothing else.
-- 
James Smith [EMAIL PROTECTED], 979-862-3725
Texas AM CIS Operating Systems Group, Unix



Re: File::Redundant

2002-04-25 Thread Andrew McNaughton



On Thu, 25 Apr 2002, James G Smith wrote:

 What's a `very large amount of data' ?  Our NIS maps are on the order
 of 3 GB per file (64k users).  Over a gigabit ethernet link, this
 still takes half a minute or so to copy to a remote system, at least
 (for NIS master-slave copies) -- this is just an example of a very
 large amount of data being sync'd over a network.  I don't see how
 transferring at least 3 GB of data can be avoided (even with diffs,
 the bits being diff'd have to be present in the same CPU at the same

rsync solves this problem with sending diffs between machines using a
rolling checksum algorithm.  It runs over rsh or ssh transport, and
compresses the data in transfer.

I'd be very interested to hear how well it works with a file of that size.

rsync has almost entirely replaced my use of scp.  It's even replaced a
fair portion of the times where I would have use cp because of it's
capability to define exclusion lists when doing a recursive copy of a
directory.

Andrew McNaughton




Re: File::Redundant

2002-04-25 Thread D. Hageman


Interesting ... not sure if implementing this in this fashion would be 
worth the overhead.  If such a need exists I would imagine that would have 
choosen a more appropriate OS level solution.  Think OpenAFS.

On Thu, 25 Apr 2002, Cahill, Earl wrote:

 Just putting about a little feeler about this package I started writing last
 night.  Wondering about its usefulness, current availability, and just
 overall interest.  Designed for mod_perl use.  Doesn't make much sense
 otherwise.
 
 Don't want to go into too many details here, but File::Redundant, takes some
 unique word (hopefully guaranteed through a database: a mailbox, a username,
 a website, etc.) which I call a thing, a pool of dirs, and how many $copies
 you would like to maintain.  From the pool of dirs, $copies good dirs are
 chosen, ordered by percent full on the given partition.
 When you open a file with my open method (along with close, this is the only
 override method I have written so far), you get a file handle.  Do what you
 like on the file handle.  When you close the file handle, with my close
 method, I CORE::close the file and use Rob Brown's File::DirSync to sync to
 all the directories.  DirSync uses time stamps to very quickly sync changes
 between directory trees.
 When a dir can't be reached (box is down or what have you), $copies good
 dirs are re-chosen and the dirsync happens from good old data to the new
 good dirs.  If too much stuff goes down, you're sorta outta luck, but you
 would have been without my system anyway.
 I would write methods for everything (within reason) you do to a file, open,
 close, unlink, rename, stat, etc.
 So who cares?  Well, using this system would make it quite easy to keep
 track of really an arbitrarily large amount of data.  The pool of dirs could
 be mounts from any number of boxes, located remotely or otherwise, and you
 could sync accordingly.  If File::DirSync gets to the point where you can
 use ftp or scp, all the better.
 There are race conditions all over the place, and I plan on
 transactionalizing where I can.  The whole system depends on how long the
 dirsync takes.  In my experience, dirsync is very fast.  Likely I would have
 dirsync'ing daemon(s), dirsync'ing as fast as they can.  In some best case
 scenario, the most data that would ever get lost would be the time it takes
 to do one dirsync (usually less than a second for even very large amounts of
 data), and the loss would only happen if you were making changes on a dir as
 the dir went down.  I would try to deal with boxes coming back up and
 keeping everything clean as best I could.
 So, it would be a work in progress, and hopefully get better as I went, but
 I would at least like to give it a shot.
 Earl
 

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//




Re: File::Redundant

2002-04-25 Thread James G Smith

Andrew McNaughton [EMAIL PROTECTED] wrote:


On Thu, 25 Apr 2002, James G Smith wrote:

 What's a `very large amount of data' ?  Our NIS maps are on the order
 of 3 GB per file (64k users).  Over a gigabit ethernet link, this
 still takes half a minute or so to copy to a remote system, at least
 (for NIS master-slave copies) -- this is just an example of a very
 large amount of data being sync'd over a network.  I don't see how
 transferring at least 3 GB of data can be avoided (even with diffs,
 the bits being diff'd have to be present in the same CPU at the same

rsync solves this problem with sending diffs between machines using a
rolling checksum algorithm.  It runs over rsh or ssh transport, and
compresses the data in transfer.

Yes - I forgot about that - it's been a year or so since I read the
rsync docs :/  but I do remember it mentioning that now.
-- 
James Smith [EMAIL PROTECTED], 979-862-3725
Texas AM CIS Operating Systems Group, Unix



RE: File::Redundant (OT: AFS)

2002-04-25 Thread Les Mikesell

 From: D. Hageman [mailto:[EMAIL PROTECTED]]
 Subject: Re: File::Redundant
 
 Interesting ... not sure if implementing this in this fashion would be 
 worth the overhead.  If such a need exists I would imagine that 
 would have 
 choosen a more appropriate OS level solution.  Think OpenAFS.

This is off-topic of course, but you often don't get
unbiased opinions from the specific list.  Does anyone
have success or horror stories about AFS in a distributed
production site?  Oddly enough the idea of using it
just came up in my company a few days ago to publish
some large data sets that change once daily to several
locations.  I'm pushing a lot of stuff around now with
rsync which works and is very efficient, but the ability
to move the source volumes around transparently and keep
backup snapshots is attractive. 

  Les Mikesell
   [EMAIL PROTECTED]





RE: File::Redundant (OT: AFS)

2002-04-25 Thread D. Hageman

On Thu, 25 Apr 2002, Les Mikesell wrote:

  From: D. Hageman [mailto:[EMAIL PROTECTED]]
  Subject: Re: File::Redundant
  
  Interesting ... not sure if implementing this in this fashion would be 
  worth the overhead.  If such a need exists I would imagine that 
  would have 
  choosen a more appropriate OS level solution.  Think OpenAFS.
 
 This is off-topic of course, but you often don't get
 unbiased opinions from the specific list.  Does anyone
 have success or horror stories about AFS in a distributed
 production site?  Oddly enough the idea of using it
 just came up in my company a few days ago to publish
 some large data sets that change once daily to several
 locations.  I'm pushing a lot of stuff around now with
 rsync which works and is very efficient, but the ability
 to move the source volumes around transparently and keep
 backup snapshots is attractive. 

I haven't personally used AFS on a large scale.  I have setup several 
small tests beds with it to test the feasibility of using it at my job.  I 
work for the EECS Department at the Universty of Kansas, so we have a 
fairly large hetergenous computer environment.  My tests showed that at 
the time, support for Windows wasn't quite up to par yet.  The *nix code 
base performed quite well.  I say at the time because since then, the 
OpenAFS project has pushed out several more versions of the code base so 
support might be better.  I did have the pleasure of talking with a guy 
from the University of Missouri that was telling me they have AFS deployed 
on a very large scale there and were very pleased with it (I think they 
were using the commercial version to support the Windows side).  AFS 
definately has some promise and if it weren't for the hetergenous issues 
(and a few non-technical issues) we would be using it here.

To avoid being completely off topic - I should point out that AFS modules 
exist for Perl and a mod_afs exist for Apache. ;-)

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//