Re: faster rsync of huge directories

2010-04-14 Thread Shachar Shemesh

guy keren wrote:


as well as sys admins/kernel developers - the initrd file on (some?) 
linux distributions is a gziped cpio file (at least on RHEL 5.X)


Initrd can come in one of two formats. These are either some (any) 
file system (you usually use some read only file system, most common of 
which is cramfs). If that is the case, the image is called initrd, and 
is available since 2.4 kernels. The other option is to put the files 
inside a cpio archive. If that is the case, the image is called 
initramfs, and is the new method (i.e. - 2.6).


Initramfs is the preferred method of creating initrd images, and so you 
can say that cpio is making a comeback... :-)


Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-13 Thread Tom Rosenfeld
On Mon, Apr 12, 2010 at 5:02 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 On Mon, Apr 12, 2010, Tom Rosenfeld wrote about Re: faster rsync of huge
 directories:
  I realized that in my case I did not really need rsync since it is a
 local
  disk to disk copy. I could have used a tar and pipe, but I like cpio:
 

 Is this quicker?


I can't tell, because it is still running, and will be for a few days, but
at least it has started copying instead of just building an index.


 If it is, then the reason of rsync's extreme slowness which you described
 was *not* the filesystem speed. It has to be something else. Maybe rsync
 simply uses tons of memory, and starts thrashing? (but this is just a
 guess,
 I didn't look at it code). If this is the case then the
 copy-while-building-
 the-list that Shachar described might indeed be a big win.

find $FROMDIR -depth -print |cpio -pdma  $TODIR
 
  By default cpio also will not overwrite files if the source is not newer.

 I recommend you use the -print0 option to find instead of -print, and
 add the -0 option to cpio. These are GNU extensions to find and cpu (and
 a bunch of other commands as well) that uses nulls, instead of newlines,
 to separate the file names. This allows newline characters in filenames
 (these aren't common, but nevertheless are legal...).

 By the way, while cpio -p is indeed a good historic tool, nowadays there
 is little reason to use it, because GNU's cp make it easier to do almost
 everything that cpio -p did: The -a option to cp is recursive and copies
 links, modes, timestamps and so on, and the -u option will only copy if
 the
 source is newer than the destination (or the destination is missing). So,

cp -au $FROMDIR $TODIR

 is shorter and easier to remember than find | cpio -p. But please note I
 didn't test this command, so don't use it on your important data without
 thinking first!

 Thanks for the tip Nadav (and everyone else.)

While we are on the topic, I use cpio because I am also historic :-) In
the past I had to do similar  copies on diff versions of *NIX (even before
rsync was invented!)
and after much testing of issues of hard links, sym links, timestamps, etc I
found cpio to be the most portable tool. I guess when I get a chance I will
test 'cp -au'

Thanks,
-tom
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-13 Thread Nadav Har'El
On Tue, Apr 13, 2010, Tom Rosenfeld wrote about Re: faster rsync of huge 
directories:
  By the way, while cpio -p is indeed a good historic tool, nowadays there
  is little reason to use it, because GNU's cp make it easier to do almost
  everything that cpio -p did: The -a option to cp is recursive and copies
...
 While we are on the topic, I use cpio because I am also historic :-) In
 the past I had to do similar  copies on diff versions of *NIX (even before
 rsync was invented!)

That's ok, because I am also historic :-) which explains why I even heard
of cpio (nowadays the only people who are likely to have even heard this
name are developers of RPM tools...). In the late 80's, I used cpio
extensively for transferring files across the atlantic on... diskettes.
I even remember one day when I arrived with a corrupt diskette, and had
to modify the cpio source code to skip over errors in the file. That day
I learned three lessons: 1. That open source rules, 2. That cpio sucks
as a backup format (because it has no error recovery capabilities) and
that there must be a better file transfer protocol than diskettes ;-)

Soon afterwards, I learned about tar and GNU cp. I haven't used cpio since...

 and after much testing of issues of hard links, sym links, timestamps, etc I
 found cpio to be the most portable tool. I guess when I get a chance I will
 test 'cp -au'

I just checked, and cp -a does seem to copy hard-linked files in a directory
correctly (i.e., the destination files are also hard-linked). I never checked,
though, if it can do more complicated hard-link copying. Frankly, I don't
really care - I stopped using hard links around the same time I stopped using
cpio. They were really important before the advent of symbolic links (in
System V release 4, if I recall correctly), but nowadays they are more often
confusing than useful - at least in my opinion.

-- 
Nadav Har'El|  Tuesday, Apr 13 2010, 30 Nisan 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |You do not need a parachute to skydive.
http://nadav.harel.org.il   |You only need one to skydive twice.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-13 Thread guy keren

Nadav Har'El wrote:

On Tue, Apr 13, 2010, Tom Rosenfeld wrote about Re: faster rsync of huge 
directories:

By the way, while cpio -p is indeed a good historic tool, nowadays there
is little reason to use it, because GNU's cp make it easier to do almost
everything that cpio -p did: The -a option to cp is recursive and copies

...
While we are on the topic, I use cpio because I am also historic :-) In
the past I had to do similar  copies on diff versions of *NIX (even before
rsync was invented!)


That's ok, because I am also historic :-) which explains why I even heard
of cpio (nowadays the only people who are likely to have even heard this
name are developers of RPM tools...).


as well as sys admins/kernel developers - the initrd file on (some?) 
linux distributions is a gziped cpio file (at least on RHEL 5.X)


--guy


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Vitaly
2010/4/12 Tom Rosenfeld tro...@bezeqint.net

 Hi,

 I am a great fan of rsync for copying filesystems. However I now have a 
 filesystem which is several hundred gigabytes and apparently has a lot of 
 small files. I have been running rsync all night and it still did not start 
 copying as it is still building the file list.
 Is there any way to get it to start copying as it goes. Or do any of you have 
 a better tool?


Are both servers in the same LAN? IMHO, your problem in network band
witchbetween source and destination.
I have ~4M files, ~800GB - rsync is very fast in the same LAN (1Gb),
and slowly for remote destination.

Regards,
Vitaly

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Shachar Shemesh

Tom Rosenfeld wrote:

Hi,

I am a great fan of rsync for copying filesystems. However I now have 
a filesystem which is several hundred gigabytes and apparently has a 
lot of small files. I have been running rsync all night and it still 
did not start copying as it is still building the file list.
Is there any way to get it to start copying as it goes. Or do any of 
you have a better tool?

Yes, there is a better tool.

Upgrade both ends to rsync version 3 or later. That version starts the 
transfer even before the file list is completely built.


Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Nadav Har'El
On Mon, Apr 12, 2010, Shachar Shemesh wrote about Re: faster rsync of huge 
directories:
 Upgrade both ends to rsync version 3 or later. That version starts the 
 transfer even before the file list is completely built.

Maybe I'm missing something, but how does this help?

It may find the first file to copy a little quicker, but finishing the
rsync will take exactly the same time, won't it?
Also, if nothing has changed, it will take it exactly the same time to
figure this out, won't it?

I'm not sure what his problem is, though. Is it the fact that the remote
rsync takes a very long time to walk the huge directory tree, or the fact
that sending the whole list over the network is slow?
If it's the first problem, then maybe switching to a different filesystem,
or reorganizing your directory structure (e.g., not to have more than a few
hundred files per directory) will help.
If it's the second problem, then maybe rsync improvements are due - i.e., to
use rsync's delta protocol not only on the individual files, but also on the
file list.

-- 
Nadav Har'El|   Monday, Apr 12 2010, 28 Nisan 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Fame: when your name is in everything but
http://nadav.harel.org.il   |the phone book.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Shachar Shemesh

Nadav Har'El wrote:

On Mon, Apr 12, 2010, Shachar Shemesh wrote about Re: faster rsync of huge 
directories:
  
Upgrade both ends to rsync version 3 or later. That version starts the 
transfer even before the file list is completely built.



Maybe I'm missing something, but how does this help?

It may find the first file to copy a little quicker, but finishing the
rsync will take exactly the same time, won't it?
  
Not at all. If the two are done linearly, then only after the entire 
directory tree is scanned will the first transfer *begin*. The total 
transfer time will be tree scan time + transfer time for older rsyncs, 
but the two overlap for newer transfers. How much time exactly that 
would save really depends on how much the second time is (i.e. - how 
much data you need to actually transfer).

Also, if nothing has changed, it will take it exactly the same time to
figure this out, won't it?
  
Yes. You might still save some time, but this, definitely, is the 
minimal advantage that newer rsyncs have over older ones.

I'm not sure what his problem is, though. Is it the fact that the remote
rsync takes a very long time to walk the huge directory tree, or the fact
that sending the whole list over the network is slow?
  

From my experience, it's mostly the former.

If it's the first problem, then maybe switching to a different filesystem,
  
At the time, we tested ext3, jfs and xfs, and found no significant 
differences between them. It was not, however, a scientific test.

or reorganizing your directory structure (e.g., not to have more than a few
hundred files per directory) will help.
  
That is likely to actually help (plugand is why rsyncrypto has the 
--ne-nesting option when encrypting file names/plug), but is not 
always a viable option.

If it's the second problem, then maybe rsync improvements are due - i.e., to
use rsync's delta protocol not only on the individual files, but also on the
file list.
  

It's not the second, typically.

Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Tom Rosenfeld
On Mon, Apr 12, 2010 at 9:41 AM, Tom Rosenfeld tro...@bezeqint.net wrote:

 Hi,

 I am a great fan of rsync for copying filesystems. However I now have a
 filesystem which is several hundred gigabytes and apparently has a lot of
 small files. I have been running rsync all night and it still did not start
 copying as it is still building the file list.
 Is there any way to get it to start copying as it goes. Or do any of you
 have a better tool?

 Thanks,
 -tom



Thanks for all the suggestions!

I realized that in my case I did not really need rsync since it is a local
disk to disk copy. I could have used a tar and pipe, but I like cpio:

  find $FROMDIR -depth -print |cpio -pdma  $TODIR

By default cpio also will not overwrite files if the source is not newer.

It was also pointed out that ver 3 of rsync now does start to copy before it
indexes all the files. Unfortunately, it is not available on CentOS 5.

-tom
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Shachar Shemesh

Tom Rosenfeld wrote:



On Mon, Apr 12, 2010 at 9:41 AM, Tom Rosenfeld tro...@bezeqint.net 
mailto:tro...@bezeqint.net wrote:


Hi,

I am a great fan of rsync for copying filesystems. However I now
have a filesystem which is several hundred gigabytes and
apparently has a lot of small files. I have been running rsync all
night and it still did not start copying as it is still building
the file list.
Is there any way to get it to start copying as it goes. Or do any
of you have a better tool?

Thanks,
-tom



Thanks for all the suggestions!

I realized that in my case I did not really need rsync since it is a 
local disk to disk copy.
Please note that rsync from local to local is just a glorified cp. It 
does not do file comparisons at all.
It was also pointed out that ver 3 of rsync now does start to copy 
before it indexes all the files. Unfortunately, it is not available on 
CentOS 5.



wget http://samba.anu.edu.au/ftp/rsync/src/rsync-3.0.7.tar.gz
tar xvzf rsync-3.0.7.tar.gz
cd rsync-3.0.7.tar.gz
./configure
make
su
make install

Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Nadav Har'El
On Mon, Apr 12, 2010, Tom Rosenfeld wrote about Re: faster rsync of huge 
directories:
 I realized that in my case I did not really need rsync since it is a local
 disk to disk copy. I could have used a tar and pipe, but I like cpio:
 

Is this quicker?
If it is, then the reason of rsync's extreme slowness which you described
was *not* the filesystem speed. It has to be something else. Maybe rsync
simply uses tons of memory, and starts thrashing? (but this is just a guess,
I didn't look at it code). If this is the case then the copy-while-building-
the-list that Shachar described might indeed be a big win.

   find $FROMDIR -depth -print |cpio -pdma  $TODIR
 
 By default cpio also will not overwrite files if the source is not newer.

I recommend you use the -print0 option to find instead of -print, and
add the -0 option to cpio. These are GNU extensions to find and cpu (and
a bunch of other commands as well) that uses nulls, instead of newlines,
to separate the file names. This allows newline characters in filenames
(these aren't common, but nevertheless are legal...).

By the way, while cpio -p is indeed a good historic tool, nowadays there
is little reason to use it, because GNU's cp make it easier to do almost
everything that cpio -p did: The -a option to cp is recursive and copies
links, modes, timestamps and so on, and the -u option will only copy if the
source is newer than the destination (or the destination is missing). So,

cp -au $FROMDIR $TODIR

is shorter and easier to remember than find | cpio -p. But please note I
didn't test this command, so don't use it on your important data without
thinking first!

-- 
Nadav Har'El|   Monday, Apr 12 2010, 28 Nisan 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I have a watch cat! If someone breaks in,
http://nadav.harel.org.il   |she'll watch.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Constantine Shulyupin
 By default cpio also will not overwrite files if the source is not newer.

Consider cp -ur

rsync also can --delete extraneous files from dest dirs

-- 
Constantine Shulyupin
Embedded Linux Expert
TI DaVinci Expert
Tel-Aviv Israel
http://www.LinuxDriver.co.il/

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


RE: faster rsync of huge directories

2010-04-12 Thread Boaz Yahav(berber)
Check out Repliweb

 

From: linux-il-boun...@cs.huji.ac.il [mailto:linux-il-boun...@cs.huji.ac.il]
On Behalf Of Tom Rosenfeld
Sent: Monday, April 12, 2010 9:41 AM
To: linux-il@cs.huji.ac.il
Subject: faster rsync of huge directories

 

Hi,

I am a great fan of rsync for copying filesystems. However I now have a
filesystem which is several hundred gigabytes and apparently has a lot of
small files. I have been running rsync all night and it still did not start
copying as it is still building the file list.
Is there any way to get it to start copying as it goes. Or do any of you
have a better tool?

Thanks,
-tom



___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: faster rsync of huge directories

2010-04-12 Thread Tom Rosenfeld
On Mon, Apr 12, 2010 at 10:04 AM, Vitaly li...@karasik.org wrote:

 2010/4/12 Tom Rosenfeld tro...@bezeqint.net
 
  Hi,
 
  I am a great fan of rsync for copying filesystems. However I now have a
 filesystem which is several hundred gigabytes and apparently has a lot of
 small files. I have been running rsync all night and it still did not start
 copying as it is still building the file list.
  Is there any way to get it to start copying as it goes. Or do any of you
 have a better tool?
 

 Are both servers in the same LAN? IMHO, your problem in network band
 witchbetween source and destination.
 I have ~4M files, ~800GB - rsync is very fast in the same LAN (1Gb),
 and slowly for remote destination.

 Regards,
 Vitaly


I am not even using a lan. It is disk to disk. I have ~16M files ~900GB.
rsync has been running about 18 hours and has indexed over 8 million files,
but still did not copy even one.

Thanks,
-tom
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il