I've put some thought into adding DB support to rsync (in a future release). This would allow it to maintain some extra information about files and be able to lookup information rapidly. This would support things like caching of checksum information, finding files to hard-link with, saving file attributes separately from the files (allowing non-root preservation of full file attributes as well as multiple attributes per inode).
I imagine adding a single option that specifies a DB config file. This option would be in a daemon's config too (with no ability for the remote user to affect a daemon) and would probably also have an environment variable equivalent (to allow all rsync commands to be affected). The config file would contain info on what DB accessor to use, connect info, and what sort of information you wish to store in the DB. I was thinking about the following table structure: ----- TABLE: disk disk_id int32 auto_increment devno int64 comment varchar(64) PRIMARY KEY: disk_id (unique) KEY: devno (non-unique) This table is auto-populated with devno information, as needed. This extra indirection allows someone to unmount a disk, set the unmounted disk's devno to 0 in the table, mount a new disk, and either update the devno of a disk that was mounted before, or let the new disk get an auto-generated disk_id (even if it ends up with the same devno as the just-unmounted disk had before). We might also want an option to not allow auto-generated disk_ids, to avoid a mount race condition (having the DB routines sleep and lookup the devno again). ----- TABLE: inode_map disk_id int32 ino int64 size int64 mtime int64 ctime int64 md4 byte(16) NULL-OK md5 byte(16) NULL-OK PRIMARY KEY: disk_id + ino (unique) KEY: size + md4 (non-unique) KEY: size + md5 (non-unique) KEY: size + mtime (non-unique) This table facilitates the caching of extra info by inode. It can also be used to lookup an inode matching certain requirements. This allows a link-by-hash algorithm, as well as the finding of alternate basis files. The checksum keys are not unique because there may be identical files that aren't hard-linked together (depending on options and hard-link limitations). ----- TABLE: name_map name_md5 byte(16) (DB-specific) name text disk_id int32 ino int64 mtime int64 ctime int64 mode int16 uid int32 gid int32 acls_id int64 NULL-OK (omit?) xattr_id int64 NULL-OK (omit?) PRIMARY KEY: name_md5 (or name) (unique) KEY: disk_id + ino (unique) This table allows the caching of file information based on name, allowing an inode to have multiple instances with differing file attributes (which is why some of the data duplicates info in the inode_map table). The use of a name_md5 field will be DB-specific, depending on if the database can handle a primary key on a really long name efficiently. If not, the DB accessor routine will create an MD5 checksum of the name and use that as the primary key. A database implementation may even choose to store the name in a separate table with a unique id if that is more efficient for it. If ACL and extended attribute information is included, it will be stored as an ID reference to separate tables. ----- Imagined calls that rsync would use: db_open(CONFIG_FILENAME_PTR, CHROOT_PATH_PTR, FLAGS); # CHROOT_PATH_PTR: can be NULL. # FLAGS: active-checksum-type, incl-acl-info, incl-xattr-info, etc. The chroot path modifies incoming filenames into a global DB context and strips the returned filenames down to work in a chroot (also ensures that no filenames outside the chroot will be returned). db_stat(FILENAME_PTR, STATX_STRUCT_PTR, CHKSUM_PTR, FLAGS); # CHKSUM_PTR: can be NULL. Will be returned if enabled in db_open(). # FLAGS: lstat/stat, use-checksum-for-stat The stat info is used during the lookup, and then updated. Stat would try to handle renamed files by using both filename and inode info, checking it for accuracy, and updating the DB if a rename had occurred. (Would not be able to handle a renamed file that had been modified.) FILENAME_PTR = db_find(PATH_PTR, CHKSUM_PTR, FLAGS, STATX_STRUCT_PTR); # PATH_PTR: can be NULL, or can specify a desired path prefix. # CHKSUM_PTR: can be NULL. Type matches db_open() flags. # FLAGS: find-any-match, find-a-match-for-hard-linking, require-prefix. The stat info is used to find a good match, and then updated. E.g. could be used by an inc_recurse transfer to find an existing hard-link somewhere in the destination hierarchy. Could be used to try to find a decent basis file or a renamed file. May want some kind of a fuzzy matching option. db_update(FILENAME_PTR, CHKSUM_PTR, FLAGS, STATX_STRUCT_PTR); # CHKSUM_PTR: can be NULL if doing MD4 checksum w/o --checksum. db_delete(FILENAME_PTR); Removes a name from the DB. I assume that inode information would be pruned when no names remain that reference the inode. Deletions would also happen internally when the code discovered that a file it was looking up no longer exists. db_close(); ----- The routines would need to be resilient enough to handle cases where the DB information is out of date with the filesystem information, checking as needed, and updating appropriately. Thoughts? ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html