Re: Storage compression patch for Rsync (unfinished)
Is there any reason why caching programs would need to set the value, rather than it just being a fixed value? I think it is hard to describe what this is for and what it should be set to. Maybe a --fixed-checksum-seed option would make some sense, or for a caching mechanism to be built in to rsync if it is shown to be very valuable. I don't think I'll include the option in 2.5.6. I know people have proposed some caching mechanisms in the past and they've been rejected for one reason or another. - Dave On Fri, Jan 17, 2003 at 11:19:35PM -0800, Craig Barratt wrote: While the idea of rsyncing with compression is mildly attractive i can't say i care for the new compression format. It would be better just to use the standard gzip or other format. If you are going to create a new file type you could at least discuss storing the blocksums in it so that the receiver wouldn't have to generate them. Yes! Caching the block checksums and file checksums could yield a large improvement for the receiver. However, an integer checksum seed is used in each block and file MD4 checksum. The default value is unix time() on the server, sent to the client at startup. So currently you can't cache block and file checksums (technically it is possible for block checksums since the checksum seed is appended at the end of each block, so you could cache the MD4 state prior to the checksum seed being added; for files you can't since the checksum seed is at the start). Enter a new option, --checksum-seed=NUM, that allows the checksum seed to be fixed. I've attached a patch below against 2.5.6pre1. The motivation for this is that BackupPC (http://backuppc.sourceforge.net) will shortly release rsync support, and I plan to support caching block and file checksums (in addition to the existing compression, hardlinking among any identical files etc). So it would be really great if this patch, or something similar, could make it into 2.5.6 or at a minimum the contributed patch area in 2.5.6. [Also, this option is convenient for debugging because it makes the rsync traffic identical between runs, assuming the file states at each end are the same too.] Thanks, Craig ### diff -bur rsync-2.5.6pre1/checksum.c rsync-2.5.6pre1-csum/checksum.c --- rsync-2.5.6pre1/checksum.cMon Apr 8 01:31:57 2002 +++ rsync-2.5.6pre1-csum/checksum.c Thu Jan 16 23:38:47 2003 @@ -23,7 +23,7 @@ #define CSUM_CHUNK 64 -int checksum_seed = 0; +extern int checksum_seed; extern int remote_version; /* diff -bur rsync-2.5.6pre1/compat.c rsync-2.5.6pre1-csum/compat.c --- rsync-2.5.6pre1/compat.c Sun Apr 7 20:50:13 2002 +++ rsync-2.5.6pre1-csum/compat.c Fri Jan 17 21:18:35 2003 @@ -35,7 +35,7 @@ extern int preserve_times; extern int always_checksum; extern int checksum_seed; - +extern int checksum_seed_set; extern int remote_version; extern int verbose; @@ -64,11 +64,14 @@ if (remote_version = 12) { if (am_server) { - if (read_batch || write_batch) /* dw */ + if (read_batch || write_batch) { /* dw */ + if ( !checksum_seed_set ) checksum_seed = 32761; - else + } else { + if ( !checksum_seed_set ) checksum_seed = time(NULL); write_int(f_out,checksum_seed); + } } else { checksum_seed = read_int(f_in); } diff -bur rsync-2.5.6pre1/options.c rsync-2.5.6pre1-csum/options.c --- rsync-2.5.6pre1/options.c Fri Jan 10 17:30:11 2003 +++ rsync-2.5.6pre1-csum/options.cThu Jan 16 23:39:17 2003 @@ -116,6 +116,8 @@ char *backup_dir = NULL; int rsync_port = RSYNC_PORT; int link_dest = 0; +int checksum_seed = 0; +int checksum_seed_set; int verbose = 0; int quiet = 0; @@ -274,6 +276,7 @@ rprintf(F, --bwlimit=KBPS limit I/O bandwidth, KBytes per second\n); rprintf(F, --write-batch=PREFIXwrite batch fileset starting with PREFIX\n); rprintf(F, --read-batch=PREFIX read batch fileset starting with PREFIX\n); + rprintf(F, --checksum-seed=NUM set MD4 checksum seed\n); rprintf(F, -h, --help show this help screen\n); #ifdef INET6 rprintf(F, -4 prefer IPv4\n); @@ -293,7 +296,7 @@ OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS, OPT_COMPARE_DEST, OPT_LINK_DEST, OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS, OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, - OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO, + OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO, OPT_CHECKSUM_SEED, OPT_NO_BLOCKING_IO, OPT_WHOLE_FILE, OPT_NO_WHOLE_FILE, OPT_MODIFY_WINDOW,
Re: Storage compression patch for Rsync (unfinished)
Is there any reason why caching programs would need to set the value, rather than it just being a fixed value? I think it is hard to describe what this is for and what it should be set to. Maybe a --fixed-checksum-seed option would make some sense, or for a caching mechanism to be built in to rsync if it is shown to be very valuable. A fixed value would be perfectly ok; the same magic value that batch mode uses (32761) would make sense. I know people have proposed some caching mechanisms in the past and they've been rejected for one reason or another. One difficulty is that additional files, or new file formats, are needed for storing the checksums, and that moves rsync further away from its core purpose. I don't think I'll include the option in 2.5.6. If I submitted a new patch with --fixed-checksum-seed, would you be willing to at least add it to the patches directory for 2.5.6? I will be adding block and file checksum caching to BackupPC, and that needs --fixed-checksum-seed. This will save me from providing a customized rsync (or rsync patches) as part of BackupPC; I would much rather tell people to get a vanilla 2.5.6 rsync release and apply the specific patch that comes with the release. Craig -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Storage compression patch for Rsync (unfinished)
On Sun, Jan 26, 2003 at 02:46:43PM -0800, Craig Barratt wrote: Is there any reason why caching programs would need to set the value, rather than it just being a fixed value? I think it is hard to describe what this is for and what it should be set to. Maybe a --fixed-checksum-seed option would make some sense, or for a caching mechanism to be built in to rsync if it is shown to be very valuable. A fixed value would be perfectly ok; the same magic value that batch mode uses (32761) would make sense. I know people have proposed some caching mechanisms in the past and they've been rejected for one reason or another. One difficulty is that additional files, or new file formats, are needed for storing the checksums, and that moves rsync further away from its core purpose. I don't think I'll include the option in 2.5.6. If I submitted a new patch with --fixed-checksum-seed, would you be willing to at least add it to the patches directory for 2.5.6? I will be adding block and file checksum caching to BackupPC, and that needs --fixed-checksum-seed. This will save me from providing a customized rsync (or rsync patches) as part of BackupPC; I would much rather tell people to get a vanilla 2.5.6 rsync release and apply the specific patch that comes with the release. Block checksums come from the receiver so cached block checksums are only useful when sending to a server which had better know it has block checksums cached. It should be relatively easy to add a test prior to setup_protocol() to determine if block checksums are cached. Given those circumstances it shouldn't be necessary to add any command-line option for this. Further, that test could set the checksum_seed so setup_protocol could test checksum_seed to see if it is alread set and not alter it eliminating the need for a checksum_seed_set. In fact doing as above and moving checksum_seed = FIXED_CHECKSUM to the places in options.c where read_batch and write_batch are set would allow reducing the checksum_seed portion of setup_protocol like so: if (remote_version = 12) { if (am_server) { - if (read_batch || write_batch) /* dw */ - checksum_seed = 32761; - else - checksum_seed = time(NULL); + if(!checksum_seed) checksum_seed = time(NULL); write_int(f_out,checksum_seed); } else { checksum_seed = read_int(f_in); } } Not only simplifying the code but i think rendering it more understandable. To save someone from looking, checksum_seed is initialized to 0 as part of the declaration in checksum.c -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Storage compression patch for Rsync (unfinished)
Block checksums come from the receiver so cached block checksums are only useful when sending to a server which had better know it has block checksums cached. The first statement is true (block checksums come from the receiver), but the second doesn't follow. I need to cover the case where the client is the receiver and the client is caching the checksums. That needs a command-line switch, since the server would otherwise use time(NULL) as the checksum seed, which is then sent from the server to the client at protocol startup. I agree with your changes though: the command-line handling code can set checksum_seed if any of write-batch, read-batch, or fixed-checksum-seed are specified, avoiding the additional variable. Craig -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Storage compression patch for Rsync (unfinished)
On Sun, Jan 26, 2003 at 06:04:52PM -0800, Craig Barratt wrote: Block checksums come from the receiver so cached block checksums are only useful when sending to a server which had better know it has block checksums cached. The first statement is true (block checksums come from the receiver), but the second doesn't follow. I need to cover the case where the client is the receiver and the client is caching the checksums. That needs a command-line switch, since the server would otherwise use time(NULL) as the checksum seed, which is then sent from the server to the client at protocol startup. OK. I'll buy that as a possibility worth allowing for. This can be another option like --server, --daemon and --sender, neither discussed on the rsync manpage with normal options nor listed in USAGE. That will eliminate user confusion. Someday we should probably get a writup that covers these protocol oriented command-line options and succinctly discusses the issues of what the sender and receiver do. The whitepaper (i reviewd it recently) is nice for describing the theory and all the math but the pertinant implimentation issues are neglected. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Storage compression patch for Rsync (unfinished)
On Sun, Jan 26, 2003 at 02:46:43PM -0800, Craig Barratt wrote: Is there any reason why caching programs would need to set the value, rather than it just being a fixed value? I think it is hard to describe what this is for and what it should be set to. Maybe a --fixed-checksum-seed option would make some sense, or for a caching mechanism to be built in to rsync if it is shown to be very valuable. A fixed value would be perfectly ok; the same magic value that batch mode uses (32761) would make sense. I know people have proposed some caching mechanisms in the past and they've been rejected for one reason or another. One difficulty is that additional files, or new file formats, are needed for storing the checksums, and that moves rsync further away from its core purpose. I don't think I'll include the option in 2.5.6. If I submitted a new patch with --fixed-checksum-seed, would you be willing to at least add it to the patches directory for 2.5.6? I will be adding block and file checksum caching to BackupPC, and that needs --fixed-checksum-seed. This will save me from providing a customized rsync (or rsync patches) as part of BackupPC; I would much rather tell people to get a vanilla 2.5.6 rsync release and apply the specific patch that comes with the release. Sorry, but I don't think it would be good to do even that until we've all had a chance to look at what's involved in the caching and whether or not it would make better sense to have it be a modification to rsync rather than mostly external. I'm concerned that people might misuse the option without understanding the consequences. You could always keep the patch on the BackupPC web site in the meantime. - Dave -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Storage compression patch for Rsync (unfinished)
While the idea of rsyncing with compression is mildly attractive i can't say i care for the new compression format. It would be better just to use the standard gzip or other format. If you are going to create a new file type you could at least discuss storing the blocksums in it so that the receiver wouldn't have to generate them. Yes! Caching the block checksums and file checksums could yield a large improvement for the receiver. However, an integer checksum seed is used in each block and file MD4 checksum. The default value is unix time() on the server, sent to the client at startup. So currently you can't cache block and file checksums (technically it is possible for block checksums since the checksum seed is appended at the end of each block, so you could cache the MD4 state prior to the checksum seed being added; for files you can't since the checksum seed is at the start). Enter a new option, --checksum-seed=NUM, that allows the checksum seed to be fixed. I've attached a patch below against 2.5.6pre1. The motivation for this is that BackupPC (http://backuppc.sourceforge.net) will shortly release rsync support, and I plan to support caching block and file checksums (in addition to the existing compression, hardlinking among any identical files etc). So it would be really great if this patch, or something similar, could make it into 2.5.6 or at a minimum the contributed patch area in 2.5.6. [Also, this option is convenient for debugging because it makes the rsync traffic identical between runs, assuming the file states at each end are the same too.] Thanks, Craig ### diff -bur rsync-2.5.6pre1/checksum.c rsync-2.5.6pre1-csum/checksum.c --- rsync-2.5.6pre1/checksum.c Mon Apr 8 01:31:57 2002 +++ rsync-2.5.6pre1-csum/checksum.c Thu Jan 16 23:38:47 2003 @@ -23,7 +23,7 @@ #define CSUM_CHUNK 64 -int checksum_seed = 0; +extern int checksum_seed; extern int remote_version; /* diff -bur rsync-2.5.6pre1/compat.c rsync-2.5.6pre1-csum/compat.c --- rsync-2.5.6pre1/compat.cSun Apr 7 20:50:13 2002 +++ rsync-2.5.6pre1-csum/compat.c Fri Jan 17 21:18:35 2003 @@ -35,7 +35,7 @@ extern int preserve_times; extern int always_checksum; extern int checksum_seed; - +extern int checksum_seed_set; extern int remote_version; extern int verbose; @@ -64,11 +64,14 @@ if (remote_version = 12) { if (am_server) { - if (read_batch || write_batch) /* dw */ + if (read_batch || write_batch) { /* dw */ + if ( !checksum_seed_set ) checksum_seed = 32761; - else + } else { + if ( !checksum_seed_set ) checksum_seed = time(NULL); write_int(f_out,checksum_seed); + } } else { checksum_seed = read_int(f_in); } diff -bur rsync-2.5.6pre1/options.c rsync-2.5.6pre1-csum/options.c --- rsync-2.5.6pre1/options.c Fri Jan 10 17:30:11 2003 +++ rsync-2.5.6pre1-csum/options.c Thu Jan 16 23:39:17 2003 @@ -116,6 +116,8 @@ char *backup_dir = NULL; int rsync_port = RSYNC_PORT; int link_dest = 0; +int checksum_seed = 0; +int checksum_seed_set; int verbose = 0; int quiet = 0; @@ -274,6 +276,7 @@ rprintf(F, --bwlimit=KBPS limit I/O bandwidth, KBytes per second\n); rprintf(F, --write-batch=PREFIXwrite batch fileset starting with PREFIX\n); rprintf(F, --read-batch=PREFIX read batch fileset starting with PREFIX\n); + rprintf(F, --checksum-seed=NUM set MD4 checksum seed\n); rprintf(F, -h, --help show this help screen\n); #ifdef INET6 rprintf(F, -4 prefer IPv4\n); @@ -293,7 +296,7 @@ OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS, OPT_COMPARE_DEST, OPT_LINK_DEST, OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS, OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, - OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO, + OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO, OPT_CHECKSUM_SEED, OPT_NO_BLOCKING_IO, OPT_WHOLE_FILE, OPT_NO_WHOLE_FILE, OPT_MODIFY_WINDOW, OPT_READ_BATCH, OPT_WRITE_BATCH, OPT_IGNORE_EXISTING}; @@ -306,6 +309,7 @@ {ignore-times,'I', POPT_ARG_NONE, ignore_times , 0, 0, 0 }, {size-only,0, POPT_ARG_NONE, size_only , 0, 0, 0 }, {modify-window,0, POPT_ARG_INT,modify_window, OPT_MODIFY_WINDOW, 0, 0 }, + {checksum-seed,0, POPT_ARG_INT,checksum_seed, OPT_CHECKSUM_SEED, 0, 0 }, {one-file-system, 'x', POPT_ARG_NONE, one_file_system , 0, 0, 0 }, {delete, 0, POPT_ARG_NONE, delete_mode , 0, 0, 0 }, {existing, 0, POPT_ARG_NONE, only_existing , 0, 0, 0 }, @@ -489,6 +493,13 @@
Re: Storage compression patch for Rsync (unfinished)
On Wed, Jan 15, 2003 at 11:50:27AM +0100, Harald Fielker wrote: Hi, i am using Rsync for making backups of a MySQL database. The MySQL files can be compressed about 1:10 and i want to make use of this fact. Rsync currently doesn't support saving files in a compressed state. I personally think this should be a feature for the filesystem (in the sense of synchronised files) but currently there is no such filesystem for Linux available. e2compr is not dead. See http://www.alizt.com/ Here my idea: We will have two new options: -X : this will specify a compress programm (e.g. gzip, bzip...) - the default compressor is gzip -Z : this will activate storage file compression. Why two options? Just specify the compressor and that enables compression. If -Z is enabled. every name (files, directories, links, ...) get's an extension called .rsc. And .rsc stands for what, rsync? Even windows has overcome the three letter extension limit. If we have a true file, there is a header section and a data section. The header section will store the followin attributes: - magic number - unpacked size - packed size - compress programm (e.g. gzip, bzip2, ...) - magic number So you add yet another compressed file format. There's something the world is crying out for. After the header section we will have the compressed file using the programm the user gave us with -X Every action in rsync will work - we will some exceptions: 1) Every file objects has the extension .rsc. 2) Doing simple checks (size, etc.) on files. the filesize needs evaluation for the .rsc header. 3) The local file needs to be decompressed when it is accessed for reading. 4) The local file needs to be compressed after it was modified or created. A header section needs to be added. 5) The file stats (atime/ctime/mtime) will be applied to the .rsc file. In normal way. Problems/ideas: 1) On Unix this will allow us only files with names 255 - strlen(.rsc) ... but this might be a very very rare case we will disable compression for this single file. Rsync already has issues with tempfile names. This is shorter. 2) Rsync will need a new option for decompressing and stating the .rsc file tree. (single file, recursive) We should also offer options for validating .rsc files and converting a tree to a .rsc filetree. I am sending some compressor patches. I am very new to the rsync source, so here a list of what i did: options.c - added -X and -Z options (-Z is passed thru a server wenn using [EMAIL PROTECTED]:/directory) flist.c: extension .rsc is added to every file/directory (in -Z mode) rsync.c: finish_transfer() now does the compression when in -Z mode before stating the file. That means the compressed file has the same stat as the uncompressed file. receiver.c: I added two new functions: - storage_decompress: this will decompress an .rsc file to a tmp file, e.g. for calculating sums (note: a delete function is missing!) - storage_decompress_update_stats: this will update a given stat structure with the decompressed filesize of the rsc file. Currently transfering new files and compressing works. But the receiver doesn't make use of the stats that storage_decompress_update_stats. I don't know if i am calling it at the right place. I also don't know if the sum is allways calculated for a file. If this is the case we need to store the md4 sum in the .rsc header. While the idea of rsyncing with compression is mildly attractive i can't say i care for the new compression format. It would be better just to use the standard gzip or other format. If you are going to create a new file type you could at least discuss storing the blocksums in it so that the receiver wouldn't have to generate them. Finally, i didn't even look at your patch because it was not text/plain. Unless absolutly necessary patches should be either inline or text/plain attachments. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html