Re: rsync filename heuristics

2005-01-04 Thread Wayne Davison
On Wed, Jan 05, 2005 at 03:23:59PM +1100, Martin Pool wrote:
> On  5 Jan 2005, Rusty Russell <[EMAIL PROTECTED]> wrote:
> > On Tue, 2005-01-04 at 18:24 +0100, Robert Lemmen wrote:
> > > i read on some webpage about rsync and debian that you wrote a patch to
> > > rsync that let's it uses heuristics when deciding which local file to
> > > use. could you tell me whether this is planned to be included in a rsync
> > > release? could i have that patch?
> > 
> > Hmm, good question.  This is from 2.5.4, and can't remember how well it
> > worked.  Good luck!
> 
> I'm not the rsync maintainer anymore, but I think it would be cool if
> this were merged, if the current team feels OK about it.

The current version of the fuzzy.diff patch is in the "patches" dir (see
CVS or any recent tar file).  I mentioned earlier on the list that this
is one of the features that I'd like to see merged for the next release,
especially now that the generator tells the receiver when it resorts to
a non-default basis file (I disliked the duplicated fuzzy search in the
earlier patch).

However, I was thinking about improving it a bit more before finally
committing it:

  - It would be nice to make it not re-scan the current directory for
each missing file.

  - Perhaps it should support finding a fuzzy match in a separate
directory (or hierarchy) instead of just the missing file's
current directory.

  - Perhaps also look into improving the fuzzy matching algorithm. E.g.
it currently requires that the suffix be identical between the
files, and for some things (such as logs, or backups, etc.) the
suffix may be different.

So, the idea is one I also like, so I assume it will make it into the
next release.  Some folks have been using a patched version of rsync
2.6.3 with the released fuzzy.diff applied, so I know that the patch
is still in good shape if you'd like to try it out.

..wayne..
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync filename heuristics

2005-01-04 Thread Martin Pool
On  5 Jan 2005, Rusty Russell <[EMAIL PROTECTED]> wrote:
> On Tue, 2005-01-04 at 18:24 +0100, Robert Lemmen wrote:
> > hi rusty,
> > 
> > i read on some webpage about rsync and debian that you wrote a patch to
> > rsync that let's it uses heuristics when deciding which local file to
> > use. could you tell me whether this is planned to be included in a rsync
> > release? could i have that patch?
> 
> Hmm, good question.  This is from 2.5.4, and can't remember how well it
> worked.  Good luck!

I'm not the rsync maintainer anymore, but I think it would be cool if
this were merged, if the current team feels OK about it.


> 
> Rusty.
> 
> diff -urN rsync-2.5.4/Makefile.in rsync-2.5.4-fuzzy/Makefile.in
> --- rsync-2.5.4/Makefile.in   2002-02-26 05:48:25.0 +1100
> +++ rsync-2.5.4-fuzzy/Makefile.in 2002-04-03 16:35:55.0 +1000
> @@ -28,7 +28,7 @@
>  ZLIBOBJ=zlib/deflate.o zlib/infblock.o zlib/infcodes.o zlib/inffast.o \
>   zlib/inflate.o zlib/inftrees.o zlib/infutil.o zlib/trees.o \
>   zlib/zutil.o zlib/adler32.o 
> -OBJS1=rsync.o generator.o receiver.o cleanup.o sender.o exclude.o util.o 
> main.o checksum.o match.o syscall.o log.o backup.o
> +OBJS1=rsync.o generator.o receiver.o cleanup.o sender.o exclude.o util.o 
> main.o checksum.o match.o syscall.o log.o backup.o alternate.o
>  OBJS2=options.o flist.o io.o compat.o hlink.o token.o uidlist.o socket.o 
> fileio.o batch.o \
>   clientname.o
>  DAEMON_OBJ = params.o loadparm.o clientserver.o access.o connection.o 
> authenticate.o
> diff -urN rsync-2.5.4/alternate.c rsync-2.5.4-fuzzy/alternate.c
> --- rsync-2.5.4/alternate.c   1970-01-01 10:00:00.0 +1000
> +++ rsync-2.5.4-fuzzy/alternate.c 2002-04-03 17:04:15.0 +1000
> @@ -0,0 +1,117 @@
> +#include "rsync.h"
> +
> +extern char *compare_dest;
> +extern int verbose;
> +
> +/* Alternate methods for opening files, if local doesn't exist */
> +/* Sanity check that we are about to open regular file */
> +int do_open_regular(char *fname)
> +{
> + STRUCT_STAT st;
> +
> + if (do_stat(fname, &st) == 0 && S_ISREG(st.st_mode))
> + return do_open(fname, O_RDONLY, 0);
> +
> + return -1;
> +}
> +
> +static void split_names(char *fname, char **dirname, char **basename)
> +{
> + char *slash;
> +
> + slash = strrchr(fname, '/');
> + if (slash) {
> + *dirname = fname;
> + *slash = '\0';
> + *basename = slash+1;
> + } else {
> + *basename = fname;
> + *dirname = ".";
> + }
> +}
> +
> +static unsigned int measure_name(const char *name,
> +  const char *basename,
> +  const char *ext)
> +{
> + int namelen = strlen(name);
> + int extlen = strlen(ext);
> + unsigned int score = 0;
> +
> + /* Extensions must match */
> + if (namelen <= extlen || strcmp(name+namelen-extlen, ext) != 0)
> + return 0;
> +
> + /* Now score depends on similarity of prefix */
> + for (; *name==*basename && *name; name++, basename++)
> + score++;
> + return score;
> +}
> +
> +int open_alternate_base_fuzzy(const char *fname)
> +{
> + DIR *d;
> + struct dirent *di;
> + char *basename, *dirname;
> + char mangled_name[MAXPATHLEN];
> + char bestname[MAXPATHLEN];
> + unsigned int bestscore = 0;
> + const char *ext;
> +
> + /* FIXME: can we assume fname fits here? */
> + strcpy(mangled_name, fname);
> +
> + split_names(mangled_name, &dirname, &basename);
> + d = opendir(dirname);
> + if (!d) {
> + rprintf(FERROR,"recv_generator opendir(%s): %s\n",
> + dirname,strerror(errno));
> + return -1;
> + }
> +
> + /* Get final extension, eg. .gz; never full basename though. */
> + ext = strrchr(basename + 1, '.');
> + if (!ext)
> + ext = basename + strlen(basename); /* ext = "" */
> +
> + while ((di = readdir(d)) != NULL) {
> + const char *dname = d_name(di);
> + unsigned int score;
> +
> + if (strcmp(dname,".")==0 ||
> + strcmp(dname,"..")==0)
> + continue;
> + 
> + score = measure_name(dname, basename, ext);
> + if (verbose > 4)
> + rprintf(FINFO,"fuzzy score for %s = %u\n",
> + dname, score);
> + if (score > bestscore) {
> + strcpy(bestname, dname); 
> + bestscore = score;
> + }
> + }
> + closedir(d);
> +
> + /* Found a candidate. */
> + if (bestscore != 0) {
> + char fuzzyname[MAXPATHLEN];
> +
> + snprintf(fuzzyname,MAXPATHLEN,"%s/%s", dirname, bestname);
> + if (verbose > 2)
> + rprintf(FINFO,"fuzzy match %s->%s\n",
> + fname, fuzzyname);
> + return do_open_r