Re: Storage compression patch for Rsync (unfinished)

2003-01-26 Thread Dave Dykstra
Is there any reason why caching programs would need to set the
value, rather than it just being a fixed value?  I think it is hard
to describe what this is for and what it should be set to.  Maybe a
--fixed-checksum-seed option would make some sense, or for a caching
mechanism to be built in to rsync if it is shown to be very valuable.
I don't think I'll include the option in 2.5.6.  I know people have
proposed some caching mechanisms in the past and they've been rejected
for one reason or another.

- Dave

On Fri, Jan 17, 2003 at 11:19:35PM -0800, Craig Barratt wrote:
  While the idea of rsyncing with compression is mildly
  attractive i can't say i care for the new compression
  format.  It would be better just to use the standard gzip or
  other format.  If you are going to create a new file type
  you could at least discuss storing the blocksums in it so
  that the receiver wouldn't have to generate them.
 
 Yes!  Caching the block checksums and file checksums could yield a large
 improvement for the receiver.  However, an integer checksum seed is used
 in each block and file MD4 checksum. The default value is unix time() on
 the server, sent to the client at startup.
 
 So currently you can't cache block and file checksums (technically it is
 possible for block checksums since the checksum seed is appended at the
 end of each block, so you could cache the MD4 state prior to the checksum
 seed being added; for files you can't since the checksum seed is at the
 start).
 
 Enter a new option, --checksum-seed=NUM, that allows the checksum seed to
 be fixed.  I've attached a patch below against 2.5.6pre1.
 
 The motivation for this is that BackupPC (http://backuppc.sourceforge.net)
 will shortly release rsync support, and I plan to support caching
 block and file checksums (in addition to the existing compression,
 hardlinking among any identical files etc).  So it would be really
 great if this patch, or something similar, could make it into 2.5.6
 or at a minimum the contributed patch area in 2.5.6.
 
 [Also, this option is convenient for debugging because it makes the
 rsync traffic identical between runs, assuming the file states at
 each end are the same too.]
 
 Thanks,
 Craig
 
 ###
 diff -bur rsync-2.5.6pre1/checksum.c rsync-2.5.6pre1-csum/checksum.c
 --- rsync-2.5.6pre1/checksum.cMon Apr  8 01:31:57 2002
 +++ rsync-2.5.6pre1-csum/checksum.c   Thu Jan 16 23:38:47 2003
 @@ -23,7 +23,7 @@
  
  #define CSUM_CHUNK 64
  
 -int checksum_seed = 0;
 +extern int checksum_seed;
  extern int remote_version;
  
  /*
 diff -bur rsync-2.5.6pre1/compat.c rsync-2.5.6pre1-csum/compat.c
 --- rsync-2.5.6pre1/compat.c  Sun Apr  7 20:50:13 2002
 +++ rsync-2.5.6pre1-csum/compat.c Fri Jan 17 21:18:35 2003
 @@ -35,7 +35,7 @@
  extern int preserve_times;
  extern int always_checksum;
  extern int checksum_seed;
 -
 +extern int checksum_seed_set;
  
  extern int remote_version;
  extern int verbose;
 @@ -64,11 +64,14 @@
   
   if (remote_version = 12) {
   if (am_server) {
 - if (read_batch || write_batch) /* dw */
 + if (read_batch || write_batch) { /* dw */
 + if ( !checksum_seed_set )
   checksum_seed = 32761;
 - else
 + } else {
 + if ( !checksum_seed_set )
   checksum_seed = time(NULL);
   write_int(f_out,checksum_seed);
 + }
   } else {
   checksum_seed = read_int(f_in);
   }
 diff -bur rsync-2.5.6pre1/options.c rsync-2.5.6pre1-csum/options.c
 --- rsync-2.5.6pre1/options.c Fri Jan 10 17:30:11 2003
 +++ rsync-2.5.6pre1-csum/options.cThu Jan 16 23:39:17 2003
 @@ -116,6 +116,8 @@
  char *backup_dir = NULL;
  int rsync_port = RSYNC_PORT;
  int link_dest = 0;
 +int checksum_seed = 0;
 +int checksum_seed_set;
  
  int verbose = 0;
  int quiet = 0;
 @@ -274,6 +276,7 @@
rprintf(F, --bwlimit=KBPS  limit I/O bandwidth, KBytes per 
second\n);
rprintf(F, --write-batch=PREFIXwrite batch fileset starting with 
PREFIX\n);
rprintf(F, --read-batch=PREFIX read batch fileset starting with 
PREFIX\n);
 +  rprintf(F, --checksum-seed=NUM set MD4 checksum seed\n);
rprintf(F, -h, --help  show this help screen\n);
  #ifdef INET6
rprintf(F, -4  prefer IPv4\n);
 @@ -293,7 +296,7 @@
OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS, OPT_COMPARE_DEST, OPT_LINK_DEST,
OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS,
OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, 
 -  OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO,
 +  OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO, OPT_CHECKSUM_SEED,
OPT_NO_BLOCKING_IO, OPT_WHOLE_FILE, OPT_NO_WHOLE_FILE,
OPT_MODIFY_WINDOW, 

Re: Storage compression patch for Rsync (unfinished)

2003-01-26 Thread Craig Barratt
 Is there any reason why caching programs would need to set the
 value, rather than it just being a fixed value?
 I think it is hard to describe what this is for and what it should be
 set to.  Maybe a --fixed-checksum-seed option would make some sense,
 or for a caching mechanism to be built in to rsync if it is shown to
 be very valuable.

A fixed value would be perfectly ok; the same magic value that batch
mode uses (32761) would make sense.

 I know people have proposed some caching mechanisms in the past and
 they've been rejected for one reason or another.

One difficulty is that additional files, or new file formats, are needed
for storing the checksums, and that moves rsync further away from its
core purpose.

 I don't think I'll include the option in 2.5.6.

If I submitted a new patch with --fixed-checksum-seed, would you be
willing to at least add it to the patches directory for 2.5.6?

I will be adding block and file checksum caching to BackupPC, and
that needs --fixed-checksum-seed.  This will save me from providing
a customized rsync (or rsync patches) as part of BackupPC; I would
much rather tell people to get a vanilla 2.5.6 rsync release and
apply the specific patch that comes with the release.

Craig
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Storage compression patch for Rsync (unfinished)

2003-01-26 Thread jw schultz
On Sun, Jan 26, 2003 at 02:46:43PM -0800, Craig Barratt wrote:
  Is there any reason why caching programs would need to set the
  value, rather than it just being a fixed value?
  I think it is hard to describe what this is for and what it should be
  set to.  Maybe a --fixed-checksum-seed option would make some sense,
  or for a caching mechanism to be built in to rsync if it is shown to
  be very valuable.
 
 A fixed value would be perfectly ok; the same magic value that batch
 mode uses (32761) would make sense.
 
  I know people have proposed some caching mechanisms in the past and
  they've been rejected for one reason or another.
 
 One difficulty is that additional files, or new file formats, are needed
 for storing the checksums, and that moves rsync further away from its
 core purpose.
 
  I don't think I'll include the option in 2.5.6.
 
 If I submitted a new patch with --fixed-checksum-seed, would you be
 willing to at least add it to the patches directory for 2.5.6?
 
 I will be adding block and file checksum caching to BackupPC, and
 that needs --fixed-checksum-seed.  This will save me from providing
 a customized rsync (or rsync patches) as part of BackupPC; I would
 much rather tell people to get a vanilla 2.5.6 rsync release and
 apply the specific patch that comes with the release.

Block checksums come from the receiver so cached block
checksums are only useful when sending to a server which had
better know it has block checksums cached.  It should be
relatively easy to add a test prior to setup_protocol()
to determine if block checksums are cached.  Given those
circumstances it shouldn't be necessary to add any
command-line option for this.  Further, that test could  
set the checksum_seed so setup_protocol could test
checksum_seed to see if it is alread set and not alter it
eliminating the need for a checksum_seed_set.

In fact doing as above and moving checksum_seed =
FIXED_CHECKSUM to the places in options.c where read_batch
and write_batch are set would allow reducing the
checksum_seed portion of setup_protocol like so:

if (remote_version = 12) {
if (am_server) {
-   if (read_batch || write_batch) /* dw */
-   checksum_seed = 32761;
-   else
-   checksum_seed = time(NULL);
+   if(!checksum_seed) checksum_seed = time(NULL);
write_int(f_out,checksum_seed);
} else {
checksum_seed = read_int(f_in);
}
}

Not only simplifying the code but i think rendering it more
understandable.

To save someone from looking, checksum_seed is initialized
to 0 as part of the declaration in checksum.c 


-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Storage compression patch for Rsync (unfinished)

2003-01-26 Thread Craig Barratt
 Block checksums come from the receiver so cached block
 checksums are only useful when sending to a server which had
 better know it has block checksums cached.

The first statement is true (block checksums come from the receiver),
but the second doesn't follow.  I need to cover the case where the
client is the receiver and the client is caching the checksums. That
needs a command-line switch, since the server would otherwise use
time(NULL) as the checksum seed, which is then sent from the server
to the client at protocol startup.

I agree with your changes though: the command-line handling code can set
checksum_seed if any of write-batch, read-batch, or fixed-checksum-seed
are specified, avoiding the additional variable.

Craig
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Storage compression patch for Rsync (unfinished)

2003-01-26 Thread jw schultz
On Sun, Jan 26, 2003 at 06:04:52PM -0800, Craig Barratt wrote:
  Block checksums come from the receiver so cached block
  checksums are only useful when sending to a server which had
  better know it has block checksums cached.
 
 The first statement is true (block checksums come from the receiver),
 but the second doesn't follow.  I need to cover the case where the
 client is the receiver and the client is caching the checksums. That
 needs a command-line switch, since the server would otherwise use
 time(NULL) as the checksum seed, which is then sent from the server
 to the client at protocol startup.

OK.  I'll buy that as a possibility worth allowing for.
This can be another option like --server, --daemon and
--sender, neither discussed on the rsync manpage with normal
options nor listed in USAGE.  That will eliminate user
confusion.

Someday we should probably get a writup that covers these
protocol oriented command-line options and succinctly
discusses the issues of what the sender and receiver do.
The whitepaper (i reviewd it recently) is nice for
describing the theory and all the math but the pertinant
implimentation issues are neglected.

-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Storage compression patch for Rsync (unfinished)

2003-01-26 Thread Dave Dykstra
On Sun, Jan 26, 2003 at 02:46:43PM -0800, Craig Barratt wrote:
  Is there any reason why caching programs would need to set the
  value, rather than it just being a fixed value?
  I think it is hard to describe what this is for and what it should be
  set to.  Maybe a --fixed-checksum-seed option would make some sense,
  or for a caching mechanism to be built in to rsync if it is shown to
  be very valuable.
 
 A fixed value would be perfectly ok; the same magic value that batch
 mode uses (32761) would make sense.
 
  I know people have proposed some caching mechanisms in the past and
  they've been rejected for one reason or another.
 
 One difficulty is that additional files, or new file formats, are needed
 for storing the checksums, and that moves rsync further away from its
 core purpose.
 
  I don't think I'll include the option in 2.5.6.
 
 If I submitted a new patch with --fixed-checksum-seed, would you be
 willing to at least add it to the patches directory for 2.5.6?
 
 I will be adding block and file checksum caching to BackupPC, and
 that needs --fixed-checksum-seed.  This will save me from providing
 a customized rsync (or rsync patches) as part of BackupPC; I would
 much rather tell people to get a vanilla 2.5.6 rsync release and
 apply the specific patch that comes with the release.


Sorry, but I don't think it would be good to do even that until we've all
had a chance to look at what's involved in the caching and whether or not
it would make better sense to have it be a modification to rsync rather
than mostly external.  I'm concerned that people might misuse the option
without understanding the consequences.  You could always keep the patch
on the BackupPC web site in the meantime.

- Dave
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Storage compression patch for Rsync (unfinished)

2003-01-17 Thread Craig Barratt
 While the idea of rsyncing with compression is mildly
 attractive i can't say i care for the new compression
 format.  It would be better just to use the standard gzip or
 other format.  If you are going to create a new file type
 you could at least discuss storing the blocksums in it so
 that the receiver wouldn't have to generate them.

Yes!  Caching the block checksums and file checksums could yield a large
improvement for the receiver.  However, an integer checksum seed is used
in each block and file MD4 checksum. The default value is unix time() on
the server, sent to the client at startup.

So currently you can't cache block and file checksums (technically it is
possible for block checksums since the checksum seed is appended at the
end of each block, so you could cache the MD4 state prior to the checksum
seed being added; for files you can't since the checksum seed is at the
start).

Enter a new option, --checksum-seed=NUM, that allows the checksum seed to
be fixed.  I've attached a patch below against 2.5.6pre1.

The motivation for this is that BackupPC (http://backuppc.sourceforge.net)
will shortly release rsync support, and I plan to support caching
block and file checksums (in addition to the existing compression,
hardlinking among any identical files etc).  So it would be really
great if this patch, or something similar, could make it into 2.5.6
or at a minimum the contributed patch area in 2.5.6.

[Also, this option is convenient for debugging because it makes the
rsync traffic identical between runs, assuming the file states at
each end are the same too.]

Thanks,
Craig

###
diff -bur rsync-2.5.6pre1/checksum.c rsync-2.5.6pre1-csum/checksum.c
--- rsync-2.5.6pre1/checksum.c  Mon Apr  8 01:31:57 2002
+++ rsync-2.5.6pre1-csum/checksum.c Thu Jan 16 23:38:47 2003
@@ -23,7 +23,7 @@
 
 #define CSUM_CHUNK 64
 
-int checksum_seed = 0;
+extern int checksum_seed;
 extern int remote_version;
 
 /*
diff -bur rsync-2.5.6pre1/compat.c rsync-2.5.6pre1-csum/compat.c
--- rsync-2.5.6pre1/compat.cSun Apr  7 20:50:13 2002
+++ rsync-2.5.6pre1-csum/compat.c   Fri Jan 17 21:18:35 2003
@@ -35,7 +35,7 @@
 extern int preserve_times;
 extern int always_checksum;
 extern int checksum_seed;
-
+extern int checksum_seed_set;
 
 extern int remote_version;
 extern int verbose;
@@ -64,11 +64,14 @@

if (remote_version = 12) {
if (am_server) {
-   if (read_batch || write_batch) /* dw */
+   if (read_batch || write_batch) { /* dw */
+   if ( !checksum_seed_set )
checksum_seed = 32761;
-   else
+   } else {
+   if ( !checksum_seed_set )
checksum_seed = time(NULL);
write_int(f_out,checksum_seed);
+   }
} else {
checksum_seed = read_int(f_in);
}
diff -bur rsync-2.5.6pre1/options.c rsync-2.5.6pre1-csum/options.c
--- rsync-2.5.6pre1/options.c   Fri Jan 10 17:30:11 2003
+++ rsync-2.5.6pre1-csum/options.c  Thu Jan 16 23:39:17 2003
@@ -116,6 +116,8 @@
 char *backup_dir = NULL;
 int rsync_port = RSYNC_PORT;
 int link_dest = 0;
+int checksum_seed = 0;
+int checksum_seed_set;
 
 int verbose = 0;
 int quiet = 0;
@@ -274,6 +276,7 @@
   rprintf(F, --bwlimit=KBPS  limit I/O bandwidth, KBytes per second\n);
   rprintf(F, --write-batch=PREFIXwrite batch fileset starting with 
PREFIX\n);
   rprintf(F, --read-batch=PREFIX read batch fileset starting with PREFIX\n);
+  rprintf(F, --checksum-seed=NUM set MD4 checksum seed\n);
   rprintf(F, -h, --help  show this help screen\n);
 #ifdef INET6
   rprintf(F, -4  prefer IPv4\n);
@@ -293,7 +296,7 @@
   OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS, OPT_COMPARE_DEST, OPT_LINK_DEST,
   OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS,
   OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, 
-  OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO,
+  OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO, OPT_CHECKSUM_SEED,
   OPT_NO_BLOCKING_IO, OPT_WHOLE_FILE, OPT_NO_WHOLE_FILE,
   OPT_MODIFY_WINDOW, OPT_READ_BATCH, OPT_WRITE_BATCH, OPT_IGNORE_EXISTING};
 
@@ -306,6 +309,7 @@
   {ignore-times,'I', POPT_ARG_NONE,   ignore_times , 0, 0, 0 },
   {size-only,0,  POPT_ARG_NONE,   size_only , 0, 0, 0 },
   {modify-window,0,  POPT_ARG_INT,modify_window, OPT_MODIFY_WINDOW, 0, 0 },
+  {checksum-seed,0,  POPT_ARG_INT,checksum_seed, OPT_CHECKSUM_SEED, 0, 0 },
   {one-file-system, 'x', POPT_ARG_NONE,   one_file_system , 0, 0, 0 },
   {delete,   0,  POPT_ARG_NONE,   delete_mode , 0, 0, 0 },
   {existing, 0,  POPT_ARG_NONE,   only_existing , 0, 0, 0 },
@@ -489,6 +493,13 @@

Re: Storage compression patch for Rsync (unfinished)

2003-01-15 Thread jw schultz
On Wed, Jan 15, 2003 at 11:50:27AM +0100, Harald Fielker wrote:
 Hi,
 
 i am using Rsync for making backups of a MySQL database. The MySQL files can 
 be compressed about 1:10 and i want to make use of this fact.
 
 Rsync currently doesn't support saving files in a compressed state. I 
 personally think this should be a feature for the filesystem (in the sense of 
 synchronised files) but currently there is no such filesystem for Linux 
 available.

e2compr is not dead.  See http://www.alizt.com/

 Here my idea:
 
 We will have two new options:
 
 -X : this will specify a compress programm (e.g. gzip, bzip...) - the default 
 compressor is gzip
 -Z : this will activate storage file compression.

Why two options?  Just specify the compressor and that
enables compression.

 If -Z is enabled. every name (files, directories, links, ...) get's an 
 extension called .rsc. 

And .rsc stands for what, rsync?  Even windows has overcome
the three letter extension limit.

 If we have a true file, there is a header section and a data section. The 
 header section will store the followin attributes:
 
 - magic number
 - unpacked size
 - packed size
 - compress programm (e.g. gzip, bzip2,  ...)
 - magic number

So you add yet another compressed file format.  There's
something the world is crying out for.

 After the header section we will have the compressed file using the programm 
 the user gave us with -X
 
 Every action in rsync will work - we will some exceptions:
 
 1) Every file objects has the extension .rsc. 
 2) Doing simple checks (size, etc.) on files. the filesize needs evaluation 
 for the .rsc header.
 3) The local file needs to be decompressed when it is accessed for reading.
 4) The local file needs to be compressed after it was modified or created. A 
 header section needs to be added.
 5) The file stats (atime/ctime/mtime) will be applied to the .rsc file. In 
 normal way.
 
 Problems/ideas:
 
 1) On Unix this will allow us only files with names 255 - strlen(.rsc) ... 
 but this might be a very very rare case we will disable compression for this 
 single file.

Rsync already has issues with tempfile names.  This is
shorter.

 2) Rsync will need a new option for decompressing and stating the .rsc file 
 tree. (single file, recursive)
 
 We should also offer options for validating .rsc files and converting a tree 
 to a .rsc filetree.
 
 I am sending some compressor patches. I am very new to the rsync source, so 
 here a list of what i did:
 
 options.c
 - added -X and -Z options (-Z is passed thru a server wenn using 
 [EMAIL PROTECTED]:/directory) 
 
 flist.c:
 extension .rsc is added to every file/directory (in -Z mode)
 
 rsync.c:
 finish_transfer() now does the compression when in -Z mode before stating the 
 file. That means the compressed file has the same stat as the uncompressed 
 file.
 
 receiver.c:
 I added two new functions: 
 - storage_decompress: this will decompress an .rsc file to a tmp file, e.g. 
 for calculating sums (note: a delete function is missing!)
 
 - storage_decompress_update_stats: this will update a given stat structure 
 with the decompressed filesize of the rsc file.
 
 
 Currently transfering new files and compressing works. But the receiver 
 doesn't make use of the stats that storage_decompress_update_stats. I don't 
 know if i am calling it at the right place. I also don't know if the sum is 
 allways calculated for a file. If this is the case we need to store the md4 
 sum in the .rsc header.

While the idea of rsyncing with compression is mildly
attractive i can't say i care for the new compression
format.  It would be better just to use the standard gzip or
other format.  If you are going to create a new file type
you could at least discuss storing the blocksums in it so
that the receiver wouldn't have to generate them.

Finally, i didn't even look at your patch because it was not
text/plain.  Unless absolutly necessary patches should be
either inline or text/plain attachments.


-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html