I hate this email client. Hate, hate, hate. Thank you, Microsoft, for making my life that little bit worse. Anyway, you can't rely on the media files being stored in a filesystem. They could be stored in a database or an object storage. So *sync is not available. I don't know how the media files are backed up. If you only want the originals, that's a lot less than 12TB (or whatever is the current number for thumbs+origs). If you just want to fetch a tarball, wget or curl will automaticallly restart a connection and supply a range parameter if the server supports it. If you want a ready-to-use format, then you're going to need a client which can write individual files. But it's not particularly efficient to stream 120B files over a separate TCP connection. You'd have to have a client which can do TCP session reuse. No matter how you cut it, you're looking at a custom client. But there's no need to invent a new download protocol or stream format. That's why I suggest tarball and range. Standards ... they're not just for breakfast.
________________________________________ From: [email protected] [[email protected]] on behalf of Peter Gervai [[email protected]] Sent: Monday, August 15, 2011 5:40 PM To: Wikimedia developers Subject: Re: [Wikitech-l] forking media files On Mon, Aug 15, 2011 at 18:40, Russell N. Nelson - rnnelson <[email protected]> wrote: > The problem is that 1) the files are bulky, That's expected. :-) > 2) there are many of them, 3) they are in constant flux, That is not really a problem: since there are many of them statistically they are not in flux. > and 4) it's likely that your connection would close for whatever reason > part-way through the download.. I seem not to forgot to mention zsync/rsync. ;-) > Even taking a snapshot of the filenames is dicey. By the time you finish, > it's likely that there will be new ones, and possible that some will be > deleted. Probably the best way to make this work is to 1) make a snapshot of > files periodically, Since I've been told they're backed up it naturally should exist. > 2) create an API which returns a tarball using the snapshot of files that > also implements Range requests. I would very much prefer ready-to-use format instead of a tarball, not to mention it's pretty resource consuming to create a tarball just for that. > Of course, this would result in a 12-terabyte file on the recipient's host. > That wouldn't work very well. I'm pretty sure that the recipient would need > an http client which would 1) keep track of the place in the bytestream and > 2) split out files and write them to disk as separate files. It's possible > that a program like getbot already implements this. I'd make a snapshot without tar especially because partial transfers aren't possible that way. -- byte-byte, grin _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
