Re: [Wikitech-l] forking media files

Russell N. Nelson - rnnelson Mon, 15 Aug 2011 15:14:59 -0700

I hate this email client. Hate, hate, hate. Thank you, Microsoft, for making my 
life that little bit worse. Anyway, you can't rely on the media files being 
stored in a filesystem. They could be stored in a database or an object 
storage. So *sync is not available. I don't know how the media files are backed 
up. If you only want the originals, that's a lot less than 12TB (or whatever is 
the current number for thumbs+origs). If you just want to fetch a tarball, wget 
or curl will automaticallly restart a connection and supply a range parameter 
if the server supports it. If you want a ready-to-use format, then you're going 
to need a client which can write individual files. But it's not particularly 
efficient to stream 120B files over a separate TCP connection. You'd have to 
have a client which can do TCP session reuse. No matter how you cut it, you're 
looking at a custom client. But there's no need to invent a new download 
protocol or stream format. That's why I suggest tarball and range. Standards 
... they're not just for breakfast.


________________________________________
From: [email protected] 
[[email protected]] on behalf of Peter Gervai 
[[email protected]]
Sent: Monday, August 15, 2011 5:40 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] forking media files

On Mon, Aug 15, 2011 at 18:40, Russell N. Nelson - rnnelson
<[email protected]> wrote:
> The problem is that 1) the files are bulky,

That's expected. :-)

> 2) there are many of them, 3) they are in constant flux,

That is not really a problem: since there are many of them
statistically they are not in flux.

> and 4) it's likely that your connection would close for whatever reason 
> part-way through the download..

I seem not to forgot to mention zsync/rsync. ;-)

> Even taking a snapshot of the filenames is dicey. By the time you finish, 
> it's likely that there will be new ones, and possible that some will be 
> deleted. Probably the best way to make this work is to 1) make a snapshot of 
> files periodically,

Since I've been told they're backed up it naturally should exist.

> 2) create an API which returns a tarball using the snapshot of files that 
> also implements Range requests.

I would very much prefer ready-to-use format instead of a tarball, not
to mention it's pretty resource consuming to create a tarball just for
that.

> Of course, this would result in a 12-terabyte file on the recipient's host. 
> That wouldn't work very well. I'm pretty sure that the recipient would need 
> an http client which would 1) keep track of the place in the bytestream and 
> 2) split out files and write them to disk as separate files. It's possible 
> that a program like getbot already implements this.

I'd make a snapshot without tar especially because partial transfers
aren't possible that way.

--
 byte-byte,
    grin

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] forking media files

Reply via email to