Re: superlifter design notes (was Re: Latest rZync release: 0.06)
On Fri, Jul 26, 2002 at 09:03:32AM -0400, Bennett Todd wrote: 2002-07-26-03:37:51 jw schultz: All that matters is that we can represent the timestamps in a way that allows consistent comparison, restoration and transfer. A very good statement indeed. There are complications, though. Some time representations used by computer systems have ambiguities, two different times that are represented with the same number, or two different representations (created at different times) that actually end up representing the same time. [...] we can pick as an epoch any time in recorded human history. I don't feel qualified to impose any epoch myself. I would be inclined to stick with the UNIX epoch for the sake of convenience. Which Unix epoch? 1970-01-01 00:00:00? 1970-01-01 00:00:10, and changing every time they issue a new leap second? Hey, the whole leap second issue is a matter for the libraries. If the library is wrong then the system time might be off by 10 seconds or so to compensate. we need not care. It doesn't matter as long as on the same platform it is the same for converting back and forth. We aren't determining whether a file on one machine or filesystem is newer or older than the coresponding file at the other end. We are determining that it is same or not (modulo precision). Newer or older are immaterial unless the system clocks are and have always been in perfect sync. Hey, some systems will be running with a timezone offset of 0 and the clock set to localtime. Conversion with any other time representation should be a matter of t * scale + offset. The trick is that offset. Given the different timekeeping systems in use, you can't correctly translate from one to another over a range of dates extending over years unless you either have a leap-second table of your own and convert to an absolute time format, or else you choose something like ISO 8601, and use local routines on each platform to convert to and from -MM-DD HH:MM:SS in UTC, recognizing that SS can exceed 59 when there are leap-seconds, and that sometimes, converting back to a machine's internal representation, you may have to fudge for that if the local conversion routines don't know about leap seconds. We only need to deal with YMD... time if that is what the system uses. POSIX platforms do not. We are talking about converion and comparison between binary values where leap-seconds don't matter. TAI has the advantage that while various platforms have troubles getting to and from it, those have often been solved by other people (djb for Unix systems), and once you get to TAI you know where you're at:-). I don't wish to disparage TAI but i've yet to see any pragmatic reason why we should use it in this context. If you are aware of one please tell us. Forget the advertising, tell us about technical details TAI and why for the purposes of file tree syncronization TAI is preferable to something more closely related to the most-common native form. I have better things to do that spelunk some obscure library implementing a time format not native to any platform. What is the actual format of TAI? The docs you point to talk of a structure and and two packed formats but do not define these. TAI may be wonderful but not suitable for this purpose. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: superlifter design notes (was Re: ...
From: jw schultz [EMAIL PROTECTED] On Fri, Jul 26, 2002 at 09:03:32AM -0400, Bennett Todd wrote: 2002-07-26-03:37:51 jw schultz: All that matters is that we can represent the timestamps in a way that allows consistent comparison, restoration and transfer. A very good statement indeed. There are complications, though. Some time representations used by computer systems have ambiguities, two different times that are represented with the same number, or two different representations (created at different times) that actually end up representing the same time. There is potential loss of precision in converting timestamps. A program serving source files for distribution does not need to be that concerned with preserving exact file attributes, but may need to track suggested file attributes for for the various client platforms. A program that is replicating for backup purposes must not have any loss of data, including any operating specific file attributes. That is why I posted previously that they should be designed as two separate but related programs. Each application has unique requirements that needlessly complicates an application that does both. -John [EMAIL PROTECTED] Personal Opinion Only -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: superlifter design notes (was Re: ...
On 27 Jul 2002, John E. Malmberg [EMAIL PROTECTED] wrote: A program serving source files for distribution does not need to be that concerned with preserving exact file attributes, but may need to track suggested file attributes for for the various client platforms. A program that is replicating for backup purposes must not have any loss of data, including any operating specific file attributes. That is why I posted previously that they should be designed as two separate but related programs. I'm not sure that the application space for rsync really divides neatly into two parts like that. Can you expand a bit more on how you think they would be used? -- Martin -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: superlifter design notes (was Re: Latest rZync release: 0.06)
I'm inclined to agree with jw that truthfully representing time and leap seconds is a problem for the operating system, not for us. We just need to be able to accurately represent whatever it tells us, without thinking very much about the meaning. Somebody previously pointed out that timestamp precision is not a property of the kernel, but rather of the filesystem on which the files are stored. In general there may be no easy way to determine it ahead of time: you can (if you squint) imagine a network filesystem with nanosecond resolution that's served by something with rather less. I suspect the only way to know may be to set the time and then read it back. You can also imagine that in the next few years some platform may change to a format that accurately represents leap seconds, whether by TAI or something else. (I'm not sure if I'd put money on it.) Presumably that machine's POSIX interface will do a lossy conversion back to regular Unix time to support old apps. If we merely used that information, then when replicating between two such machines, files whose mtime happened to near on a leap second would be inaccurate. That would contradict our goal of preserving precision as much as possible, even if we can't tell if it is accurate. Ideally, we would use the native interface so as to be able to get the machine's full precision, and that would imply something like TAI internally. Whether this is worth doing depends on whether you reckon any platform will actually move to a filesystem that can represent leap seconds. As jw says, practically all machines have clocks with more than one second of inaccuracy, so handling leap seconds is not practically important. Certainly they might use it within their ntp code, but I don't know if they'll expose it to applications. What is the actual format of TAI? 64-bit signed seconds-since-1970, plus optionally nanoseconds, plus optionally attoseconds. (There's something rather fascinating about using attoseconds.) To be fair, it seems that TAI is an international standard, and djb just made up libtai, not the whole thing. (Mind you, from some standards I've seen, that would be a good reason to walk briskly away.) One drawback, which is not realy djb's fault, is that if you inadvertently use a TAI value as a Unix value it will be about 10 seconds off -- almost, but not quite, correct. I'd hate to have bugs like that but presumably they can be avoided by using the interface correctly. On the other hand, sint32 unix time is clearly running out, and if we have to use something perhaps it might as well be TAI. I would kind of prefer just a single 64-bit quantity measured in (say) nanoseconds, and compromise on being able to time the end of the universe, but I don't think I care enough to invent a new standard. -- Martin -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: superlifter design notes (was Re: ...
Martin Pool wrote: On 27 Jul 2002, John E. Malmberg [EMAIL PROTECTED] wrote: A program serving source files for distribution does not need to be that concerned with preserving exact file attributes, but may need to track suggested file attributes for for the various client platforms. A program that is replicating for backup purposes must not have any loss of data, including any operating specific file attributes. That is why I posted previously that they should be designed as two separate but related programs. I'm not sure that the application space for rsync really divides neatly into two parts like that. Can you expand a bit more on how you think they would be used? Well remember, I am on the outside looking in, and of course I could be missing things. :-) I did post this previously, but the message apparently got buried the large number of messages posted that day. The two uses for rsync that I am seeing discussed on this list are: Backup: A low overhead and possibly distance backup of disks or directory. In the case of a backup, usually it is the same platform, or one that is very close to being the same. Also it is important that security information, and file attributes all be properly maintained. The mapping of security information is platform specific, so this is a going to be an ongoing problem. It is also critical that timestamps be maintained. Since this is usually the same or closely similar platforms, a VFS layer can be used to store and retrieve attributes. No special attribute files or host based translations should be needed. The downsides are that as far as I can see there are no portable standard APIs to retrieve the security information, and as more variants are discovered, it may be hard to work them in for backward compatability. Because you are distributing an arbitrary set of directories, it is ususally not permitted to add files to assist in the transfer. This also seems to be an addition to rsync's original mission. Also using something like rsync for backup of binary files has the potential for undetected corruption. While the checksumming algorithm is good, it is not guaranteed to be perfect. And no, I do not want to recycle the old arguments about this. With a text file, the set of possible values is restricted enough that it is unlikely that the checksum method would fail, and if it did, the resulting corruption is more easily detected. File Distribution: A low overhead method of keeping local source directory trees synchronized with remote distributions. In this case, strict binary preservation of time stamps is not needed and maintaining security attributes is usually not desired. So that is two problems eliminated. What rsync does not do now, is differentiate between text files and binary files. A client that uses a different internal format for text files than binary files needs to do extra work. And unless the server tells it what type of file is coming, it must guess based on the filename. But you are specifically distributing a special tree of files in this case, not an arbitrary directory. That gives you the ability to add special attribute files to assist in the transfer. So while the two uses have a lot in common, there are significant differences, and having one program attempt to do both can lead to greater complexity. -John [EMAIL PROTECTED] Personal Opinion Only -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: superlifter design notes (was Re: Latest rZync release: 0.06)
2002-07-21-04:12:55 jw schultz: On Thu, Jul 11, 2002 at 07:06:29PM +1000, Martin Pool wrote: 6. No arbitrary limits: this is related to scalability. Filesizes and times should be 64-bit; names should be arbitrarily long. File sizes, yes. Times, no. unsigned 32 bit integers will last us for another 90 years. I suspect that by the time we need 64 bit timestamps the units will be milliseconds. I just don't see the need to waste an extra 4 bytes per timestamp per file. If bandwidth is of any interest at all, compress; any compression algorithm will have no trouble making hay with bulky, redundant timestamp formats. Rather than trying to optimize the protocol for bandwidth without compression, wouldn't it be better to try to optimize to future-proof in the face of changing time representations across systems? If I were designing a protocol at this level, I'd be using TAI; there's 64-bit time with 1 second resolution covering pretty much all time (more or less, depending on the whimsies of cosmologists:-); there are also longer variations with finer resolution. TAI, with appropriately fine resolution, should be able to represent any time that any other representation can, closer than anyone could care. TAI can be converted to other formats with more or less pain, depending on how demented the other formats are; djb's libtai is a reasonable starting point. URL:http://cr.yp.to/time.html has links to some pages discussing time formats. In short, though, Time since the epoch has a complication: leap-seconds. Either you end up having to move the epoch every time you bump into a leap-second, thereby redefining all times before that; or else you have duplicate times, where two different seconds have the same representation in seconds-since-the-epoch. Well, there's a third possibility, you could also let the current time drift further and further from what everybody else is using, but nobody seems to go for that one. -Bennett msg04673/pgp0.pgp Description: PGP signature
Re: superlifter design notes (was Re: Latest rZync release: 0.06)
On 21 Jul 2002, jw schultz [EMAIL PROTECTED] wrote: .From what i can see rsync is very clever. The biggest problems i see with its inability to scale for large trees, a little bit of accumulated cruft and featuritis, and excessively tight integration. Yes, I think that's basically the problem. One question that may (or may not) be worth considering is to what degree you want to be able to implement new features by changing only the client. So with NFS (I'm not proposing we use it, only an example), you can implement any kind of VM or database or whatever on the client, and the server doesn't have to care. The current protocol is just about the opposite: the two halves have to be quite intimately involved, so adding rename detection would require not just small additions but major surgery on the server. What i am seeing is a Multi-stage pipeline. Instead of one side driving the other with comand and response codes each side (client/server) would set up a pipeline containing those components that are needed with the appropriate plumbing. Each stage would largly look like a simple utility reading from input; doing one thing; writing to output, error and log. The output of each stage is sent to the next uni-directionally with no handshake required. So it's like a Unix pipeline? (I realize you're proposing pipelines as a design idea, rather than as an implementation.) So, we could in fact prototype it using plain Unix pipelines? That could be interesting. Choose some files: find ~ | lifter-makedirectory /tmp/local.dir Do an rdiff transfer of the remote directory to here: rdiff sig /tmp/local.dir /tmp/local.dir.sig scp /tmp/local.dir.sig othermachine:/tmp ssh othermachine 'find ~ | lifter-makedirectory | rdiff delta /tmp/local.dir.sig - ' /tmp/remote.dir.delta rdiff patch /tmp/local.dir /tmp/remote.dir.delta /tmp/remote.dir For each of those files, do whatever for file in lifter-dirdiff /tmp/local.dir /tmp/remote.dir do ... done Of course the commands I've sketched there don't fix one of the key problems, which is that of traversing the whole directory up front, but you could equally well write them as a pipeline that is gradually consumed as it finds different files. Imagine lifter-find-different-files /home/mbp/ othermachine:/home/mbp/ | \ xargs -n1 lifter-move-file (I'm just making up the commands as I go along; don't take them too seriously.) That could be very nice indeed. I am just a little concerned that a complicated use of pipelines in both directions will make us prone to deadlock. It's possible to cause local deadlocks if e.g. you have a child process with both stdin and stdout connected to its parent by pipes. It gets potentially more hairy when all the pipes are run through a single TCP connection. I don't think that concern rules this design out by any means, but we need to think about it. One of the design criteria I'd like to add is that it should preferably be obvious by inspection that deadlocks are not possible. timestamps should be represented as seconds from Epoch (SuS) as unsigned 32 int. It will be 90 years before we exceed this by which time the protocol will be extended to use uint64 for milliseconds. I think we should go to milliseconds straight away: if I remember correctly, NTFS already stores files with sub-second precision, and some Linux filesystems are going the same way. A second is a long time in modern computing! (For example, it's possible for a command started by Make to complete in less than a second, and therefore apparently not change a timestamp.) I think there will be increasing pressure for sub-second precision in much less than 90 years, and it would be sensible for us to support it from the beginning. The Java file APIs, for example, already work in nanoseconds(?). Transmitting the precision of the file sounds good. I think by default user and groups only be handled numerically. I think by default we should use names, because that will be least surprising to most people. I agree we need to support both. Names are not universally unique, and need to be qualified, by a NIS domain or NT domain, or some other means. I want to be able to say: map MAPOOL2@ASIAPAC - [EMAIL PROTECTED] - [EMAIL PROTECTED] when transferring across machines. We probably cannot assume UIDs are any particular length; on NT they correspond to SIDs (?) which are 128-bit(?) things, typically represented by strings like S1-212-123-2323-232323 So on the whole I think I would suggest following NFSv4 and just using strings, with the intreptation of them up to the implementation, possibly with guidance from the admin. When textual names are used a special chunk in the datastream would specify a node+ID - name equivalency immediately before the first use of that number. It seems like in general there is a need to have
Re: superlifter design notes (was Re: Latest rZync release: 0.06)
On Mon, Jul 22, 2002 at 02:00:21PM +1000, Martin Pool wrote: On 21 Jul 2002, jw schultz [EMAIL PROTECTED] wrote: .From what i can see rsync is very clever. The biggest problems i see with its inability to scale for large trees, a little bit of accumulated cruft and featuritis, and excessively tight integration. Yes, I think that's basically the problem. One question that may (or may not) be worth considering is to what degree you want to be able to implement new features by changing only the client. So with NFS (I'm not proposing we use it, only an example), you can implement any kind of VM or database or whatever on the client, and the server doesn't have to care. The current protocol is just about the opposite: the two halves have to be quite intimately involved, so adding rename detection would require not just small additions but major surgery on the server. What i am seeing is a Multi-stage pipeline. Instead of one side driving the other with comand and response codes each side (client/server) would set up a pipeline containing those components that are needed with the appropriate plumbing. Each stage would largly look like a simple utility reading from input; doing one thing; writing to output, error and log. The output of each stage is sent to the next uni-directionally with no handshake required. So it's like a Unix pipeline? (I realize you're proposing pipelines as a design idea, rather than as an implementation.) I'm kinda, sorta proposing both. What i'm looking at is to keep each stage as simple as possible without sharing datastructures with other stages. And that it should be possible to break/intercept the pipeline at any point. So, we could in fact prototype it using plain Unix pipelines? For local-to-local yes. That could be interesting. Choose some files: find ~ | lifter-makedirectory /tmp/local.dir Do an rdiff transfer of the remote directory to here: rdiff sig /tmp/local.dir /tmp/local.dir.sig scp /tmp/local.dir.sig othermachine:/tmp ssh othermachine 'find ~ | lifter-makedirectory | rdiff delta /tmp/local.dir.sig - ' /tmp/remote.dir.delta rdiff patch /tmp/local.dir /tmp/remote.dir.delta /tmp/remote.dir For each of those files, do whatever for file in lifter-dirdiff /tmp/local.dir /tmp/remote.dir do ... done Of course the commands I've sketched there don't fix one of the key problems, which is that of traversing the whole directory up front, but you could equally well write them as a pipeline that is gradually consumed as it finds different files. Imagine lifter-find-different-files /home/mbp/ othermachine:/home/mbp/ | \ xargs -n1 lifter-move-file (I'm just making up the commands as I go along; don't take them too seriously.) That could be very nice indeed. I'm not seriously suggesting that each stage be a seperate utility but there would be times when being able to treat them as such would be advantageous. I am just a little concerned that a complicated use of pipelines in both directions will make us prone to deadlock. It's possible to cause local deadlocks if e.g. you have a child process with both stdin and stdout connected to its parent by pipes. It gets potentially more hairy when all the pipes are run through a single TCP connection. Where in+out are connected to the same parent (multiplexing TCP) that parent would have to use poll or select. In the ssh case it might be possible to use the port forwarding features of ssh or borrow the code from there. We should plagiarise where sensible. One key advantage of the looser coupling and of stages is that they are immune to changes in the plumbing. I don't think that concern rules this design out by any means, but we need to think about it. Absolutely! One of the design criteria I'd like to add is that it should preferably be obvious by inspection that deadlocks are not possible. timestamps should be represented as seconds from Epoch (SuS) as unsigned 32 int. It will be 90 years before we exceed this by which time the protocol will be extended to use uint64 for milliseconds. I think we should go to milliseconds straight away: if I remember correctly, NTFS already stores files with sub-second precision, and some Linux filesystems are going the same way. A second is a long time in modern computing! (For example, it's possible for a command started by Make to complete in less than a second, and therefore apparently not change a timestamp.) I think there will be increasing pressure for sub-second precision in much less than 90 years, and it would be sensible for us to support it from the beginning. The Java file APIs, for example, already work in nanoseconds(?). Transmitting the precision of the file sounds good. I think by default user and groups only be handled numerically. I think by default
Re: superlifter design notes (was Re: Latest rZync release: 0.06)
People have proposed network-endianness, ascii fields, etc. Here's a straw-man proposal on handling this for people to criticize, ignite, feed to horses, etc. I don't have any specific numbers to back it up, so take it with a grain of salt. Experiments would be pretty straightforward. Swabbing to/from network endianness is very cheap. On 486s and higher it is a single inlined instruction and I think takes about one cycle. On non-x86 it is free. The cost is barely worth considering: if you are flipping words as fast as you can you will almost certainly be limited by memory bandwidth, not by the work of swapping them. BER-style variable length fields, on the other hand, are very intensive, because you need to look at the top bit, mask it, shift, continue. If you're going to use a protocol that difficult, I think you might as well use ASCII hex or decimal numbers. All other things being equal having a readable protocol is good. A little redundancy in the protocol can help make it readable and also help detect errors. For example, distcc's 4-char commands make it easy for humans to visually parse a packet, and they make errors in transmission almost always immediately cause an error. At the same time they're cheap to process -- it's just a uint32 compare. Arguably we should use x86-endianness because it's the most common architecture at the moment, but I don't think the performance justifies using something non-standard. Anyhow, I would hope that if it gets off the ground, this protocol might still be in use in ten years, in which time x86 may no longer be dominant. Bigendian also has the minor advantage that it's easier to read in packet dumps. Negotiated protocols are a bad idea because they needlessly multiply the test domain. Samba has to deal with Microsoft protocols which are in theory negotiated-endian, but in practice of course Microsoft never test anything but Intel, so BE support is broken and people writing non-x86 servers need to negotiate Intel endianness. Even assuming we're smarter than they are, I don't think we need to make our lives difficult in this way. Lempel-Ziv is ideal for the exact case of compressing 0x0001 into a couple of bits. Even a very cheap compressor such as lzo (about half the speed of memcpy) will do well on that kind of case; presumably numbers like uint64 0, 1, 2, etc will occur often in packet headers and get tightly compressed. I think it will probably deal with filenames for us too. So, as a straw man: - use XDR-like network-endian 32 and 64 bit fields - keep all fields 4-byte aligned - make strings int32 length-preceded, and padded to a 4-byte boundary - don't worry about interning or compressing filenames, just send then as plain UTF-8 relative to a working directory - send things like usernames as strings too - make operation names (or whatever) be human-readable, either variable-length strings or 4-byte tokens that happen to be readable as ascii -- Martin -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
superlifter design notes (was Re: Latest rZync release: 0.06)
I've put a cleaned-up version of my design notes up here http://samba.org/~mbp/superlifter/design-notes.html It's very early days, but (gentle :-) feedback would be welcome. It has some comments on Wayne's rzync design, which on the whole looks pretty clever. I don't have any worthwhile code specifically towards this yet, but I have been experimenting with the protocol ideas in distcc http://distcc.samba.org/ I like the way it has worked out there: the protocol is simple and easy to understand, the bugs more or less found themselves, and it feels like I'm using TCP in a natural way -- all of these much more so than rsync at the moment. (Of course, the rsync problem is much more complicated.) -- Martin -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html