Re: directory replication between two servers

2002-07-03 Thread Eric Ziegast

 I am two Linux servers with rsync server running on both. Now I am
 replicating directories in both servers with the command rsync -avz 
 My requirement is, if I made any changs in the first server, say server
 A,   I want to see the changes in the scond server immediatelysome
 thing similar to mysql database replicationhow can I do that..??

... a vague question.  It depends on the application.

In high-avilability environments it's best to do the replication in the
application so that the application can deal with or work around any
failure conditions.  In the case of a database, database replication
methods work better than depending on the filesystem.  The filesystem does
not know the state of transactions within the database.

Imagine this: Instead of having your client application write to one
filesystem, have it write to two filesystems before saying the write
was completed or committed.  If one system fails, the other is updated
just as well as the failed filesystem (caveat: I'm ignoring race
conditions!).


If you need read-write access on both local and remote servers and have
partitioned data sets (i.e. don't need to depend on block-level locking),
consider having both servers use a dedicated high-availability network
attached storage server (HA solution).  Both can access an NFS server,
or the second server can mount the filesystem from the first server (not
an HA solution).


If you need read-write access on one server and need to replicate data
to a read-only server _and_ if the replicaiton process can be asynchronous,
doing multiple rsyncs can work.

while true
do
rsync -avz source destination
if [ $? != 0 ]; then
Get Help
fi
done

If you know where your applications are doing writes, you might limit
your replication to the subdirectory or files on which writes are
performed to help speed up the search process.  Note, though, that
rsync-based replicaiton methods are not efficient on the disks or
filesystems, just the network traffic.  Imagine reading _all_ of your
data over and over and over and over again when only a few blocks might
change periodically.


If you need read-write access on one server and need to replicate data
to a read-only server and need synchronous operation (i.e.: the
write must be completed on the remote server before returning to the
local server), then you need operating-system-level or storage-level
replication products.

Veritas:
It's not available on Linux yet, but Volume Replicator performs
block-level incremental copies to keep two OS-level filesystems
in sync.  $$

File Replicator is based (interestingly enough) on rsync, and
runs under a virtual filesystem layer.  It is only as reliable
as a network-wide NFS mount, though.  (I haven't seen it used
much on a WAN.)  $$

Andrew File System (AFS)
This advanced filesystem has methods for replication
built in, but have a high learning curve for making them
work well.  I don't see support for Linux, though. $

Distributed File System (DFS)
Works alot like AFS, built for DCE clusters, commercially
supported (for Linux too)  $$$

NetApp, Procom (et.al.):
Several network-attached-storage providers have replication
methods built into their products.  The remote side is kept
up to date, but integrity of the remote data depends on the
application's use of snapshots.  $$$

EMC, Compaq, Hitachi (et.al.):
Storage companies have replication methods and best practices
built into their block-level storage products.   


Another alternative (cheaper, too) is to just use a database, period.
People who worry about data storage, data integrity, failover, and
replication have put alot of thought into their database products.
If you can modify your application to depend on a database and not
a filesystem, you may be better off in the long run.  Lazy people use
filesystems as their database.  It works just fine up to the point
where you need to worry about real-time replication.

Again, it really depends on the application.

If others know of other replication methods or distributed filesystem
work, feel free to chime in.

--
Eric Ziegast

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: -c Option

2002-05-30 Thread Eric Ziegast

 Quick questioncan anyone explain to me when the data in a file
 might change without changing the mtime, ctime or size?  I'm not
 sure I've ever come across that before.  An example might help me
 determine if I can safely remove -c.

It's possible on Unix systems, but not practical.

An example script:

  #!/bin/sh
  # Run on a BSD Unix system, your touch(1) arguments may vary
  echo foo  File
  touch -t 200205300800 File
  ls -l File
  echo foo  File
  ls -l File
  touch -t 200205300800 File
  ls -l File

Output:

  -rw-r--r--  1 ziegast  ziegast  4 May 30 08:00 File
  -rw-r--r--  1 ziegast  ziegast  4 May 30 11:19 File
  -rw-r--r--  1 ziegast  ziegast  4 May 30 08:00 File


Here's one example where I might use -c:

  Hackers are not known for being practical.  They like to cover their
  tracks as best they can by setting the owner, group, permissions, size,
  mtime, and ctime of their fake programs to be the same as the original
  programs.  If you're distributing system software using rsync or rdist,
  you may want to force checksum comparisons just to be sure.

  One might also use rsync with -n -c to help compare a gold copy
  of OS files with an active system.  Then again, tripwire was designed
  to do this better.


Another example is when some network-based filesystems delay updating
their metadata even though the content has changed.  I remember once
using a client machine to update a file on a busy NFS server.  It took
several seconds for the change to be seen by another client machine.
If I were using rsync on the second client machine, its view of the
file might be inconsistent.  Then again, best practices would dictate
my wanting to run rsync on the NFS server and not on the clients that
might be inconsistent with the server (to keep network traffic down
and reduce overhead).

--
Eric Ziegast

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Rsync dies

2002-05-17 Thread Eric Ziegast

 In my humble opinion, this problem with rsync growing a huge memory
 footprint when large numbers of files are involved should be #1 on
 the list of things to fix.

I think many would agree.  If it were trivial, it'd probably be
done by now.

Fix #1 (what most people do):

Split the files/paths to limit the size of each job.

What someone could/should do here is at least edit the
BUGS section of the manual to talk about the memory
restrictions.

Fix #2 (IMHO, what should be done to rsync):

File caching of results (or using a file-based database of
some sorts) is the way to go.  Instead of maintaining a
data structure entirely in memory, open a (g)dbm file or add
hooks into the db(3) libraries to store file metadata and
checksums.

It'll be slower than an all-memory implementation, but large
jobs will at least finish predictably.

Fix #3 (what I did):

If you really really need to efficiently transfer large
numbers of files, come up with your own custom process.

I used to run a large web site with thousands of files and
directories that needed to be distributed to dozens of
servers atomically.  Using rsync, I'd run into memory
problems and worked around them with Fix #1.  Another
problem was running rsync in parallel.  The source directory
was scanned order(N) times when it needed to be scaned only
once.  The source content server was pummeled from the
multiple simultaneous instances.  I resorted to making my
own single-threaded rsync-like program in Perl to behave
more like Fix #2 and runs very efficiently.

I've spent a some time cleaning up this program so that
I can publish it, but priorities (*) are getting in the
way.  When I get some time, you'll see it posted here.

--
Eric Ziegast

(*) Looking for a full-time job is a full-time job.  :^(
Will consult for food.

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Does any rsync-based diff, rmdup, cvs software exist?

2002-05-16 Thread Eric Ziegast

 I'd like to be able to run GNU-diff type comparisons,
 but use R-sync technology to make it efficient to see what's 
 different without transmitting all that data.

Rsync is great at synchronizing data between a source and destination.
For diff-like comparisons, perhaps something like CVS might be more
apropriate.

 Another thing I like to do using rsync protocol, 
 is what I call rmdup -- remove duplicates.
 This would allow me to recursively (like diff -r) compare files in
 two (!!MUST BE!!) different directories and remove one (or the other)
 of the duplicates.

A shell script that does something similar to what you want without
using rsync

  #!/bin/sh
  
  # Our md5 checksum program (rsync uses md4, but the concept is the same)
  MD5=md5sum# On RedHat 7.1
  #MD5=md5  # In *bsd
  
  # Inventory the source directory
  cd $SOURCE_DIR
  src=/var/tmp/find.$$.src
  find -x -type f -print | xargs $MD5 | awk {print $2, $1} | sort  $src
  
  # Inventory the destination directory
  cd $DESTINATION_DIR
  dst=/var/tmp/find.$$.dst
  find -x -type f -print | xargs $MD5 | awk {print $2, $1} | sort  $dst
  
  # Remove duplicates in the destination directory
  cd $DESTINATION_DIR
  comm -12 $src $dst | sed -e 's/ .*//' | xargs rm -i

  # rm $src $dst

Note: comm -12 does a line by line comparison of the two checksum
  lists.  The output is lines common to both files.  If a
  filename/checksum matchs for both the source and destination
  directory, the file in the destination directory is the
  duplicate (per the definition in the e-mail) and is piped
  to xargs rm for removal.

Note: Configuring for use with source or destination directory on
  a remote host would include the strategic use of rsh or ssh.
  The good news is that because only a list of checksums is
  needed for comparison, the bandwidth needed between servers
  is minimized (like rsync).

 Again, the rsync protocol could be useful in configuration management,
 for computing the deltas that must be stored.

CVS (or even RCS) is more useful for configuration management and
updates of text files.  It also archives changes over time.

As far as I'm aware (without looking at source code), rsync does
block-level comparisons, not line-by-line.

--
Eric Ziegast

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsynch related question

2002-05-16 Thread Eric Ziegast

Uma asks:
  I had a question on rsynch'ing, please do help me on this,
  I have scenario where,I need to synch the two softwares, where
  both the software are across the network, on different cells
  (AFS cells).
  For ex: first1 - /afs/tr/software , second1 - /afs/ddc/software
  Both the softwares are same  fist1 cell will be constantly
  updating and I need to synch this software to sencond1.  In this
  scenario what command I should use ?

There are many ways to do it based on your needs, and from where
you want to drive the process.


 Push using local filesystems

If both AFS trees are on the same LAN with low latency and high
bandwidth available, you can just access them directly:

# On any server...
cd /afs/tr/software
rsync -ax . /afs/ddc/software


 Push to remote server using rsh/ssh

If the AFS trees exist is different locations with significant
delay between them or not much bandwidth, then is is more efficient
to use rsync between servers at both locations to minimize
bandwidth needs between locations.  Each server (eg: TR-SERVER and
DDC-SERVER) would scann the drectory trees locally and transmit
only inventory information and changes to files over the WAN.

# On TR-SERVER...
cd /afs/tr/software
rsync -ax . USER@DDC-SERVER:/afs/ddc/software

The above pushes files out using rsh.  If you want to use ssh
or a Kerberized rsh, consider -e ssh or -e 'rsh -K'.

If the content is usually compressable, consider using -z
to save more bandwidth.


 Suck from remote server with rsyncd

You can also suck files by setting up an rsync server on a server
at the /afs/tr node and have rsync clients on the net connect
to the server to suck down their files.

I haven't used rsyncd before, but the syntax might look something
like this:

# On DDC-SERVER...
cd /afs/ddc/software
rsync -ax USER@TR-SERVER::software .


# In /etc/rsyncd.conf on TR-SERVER...
[software]
path=/afs/tr/software
... other options based on access/security ...

See the rsync(1) man page for more information about syntax with
an rsync server.  See rsyncd.conf(5) for more info about configuring
rsync servers.  There are examples on how to setup and rsync server
here:
http://everythinglinux.org/rsync/
http://www.freeos.com/articles/4042/


 Suck from remote server with rsh/ssh

Another simpler way to use rsync to suck files over the network using
rsh (or ssh) is:

# On DDC-SERVER...
cd /afs/ddc/software
rsync -ax [-e ssh] USER@TR-SERVER:/afs/tr/software .



There are many other command options you might consider, but they are
based more on the content than connectivity.

--
Eric Ziegast

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html