On 5/12/10 9:44 AM, Jason Wood wrote: > Hi, Welcome!
> Suppose I have a file of 100GB and 2 storage nodes each with 75GB > available, will I be able to store the file or does it have to fit > within the realms of a single node? I think tahoe does what you want here. What matters is the size of the shares that Tahoe generates, not the size of the original file, and you have control over those shares. The ability to store the file will depend upon how you set the encoding parameters: you get to choose the tradeoff between expansion (how much space gets used) and reliability. The default settings are "3-of-10" (very conservative), which means the file is encoded into 10 shares, and any 3 will be sufficient to reconstruct it. That means each share will be 1/3rd the size of the original file (plus a small overhead, less than 0.5% for large files). For your 100GB file, that means 10 shares, each of which is 33GB in size, which would not fit (it could get two shares on each server, but it couldn't place all ten, so it would return an error). But you could set the encoding to 2-of-2, which would give you two 50GB shares, and it would happily put one share on each server. That would store the file, but it wouldn't give you any redundancy: a failure of either server would prevent you from recovering the file. You could also set the encoding to 4-of-6, which would generate six 25GB shares, and put three on each server. This would still be vulnerable to either server being down (since neither server has enough shares to give you the whole file by itself), but would become tolerant to errors in an individual share (if only one share file were damaged, there are still five other shares, and we only need four). A lot of disk errors affect only a single file, so there's some benefit to this even if you're still vulnerable to a full disk/server failure. So, you can set the encoding parameters (in the "tahoe.cfg" file) to whatever you like, to meet your goals. > Do I need to shutdown all clients/servers to add a storage node? Nope. You can add or remove clients or servers anytime you like. The central "Introducer" is responsible for telling clients and servers about each other, and it acts as a simple publish-subscribe hub, so everything is very dynamic. Clients re-evaluate the list of available servers each time they do an upload. This is great for long-term servers, but can be a bit surprising in the short-term: if you've just started your client and upload a file before it has a chance to connect to all of the servers, your file may be stored on a small subset of the servers, with less reliability than you wanted. We're still working on a good way to prevent this while still retaining the dynamic server discovery properties (probably in the form of a client-side configuration statement that lists all the servers that you expect to connect to, so it can refuse to do an upload until it's connected to at least those). A list like that might require a client restart when you wanted to add to this "required" list, but we could implement such a feature without a restart requirement too. > Finally, I see I can link files on the cluster (very useful!), does this > make an actual link or copy the data? Does the target file have to > reside on the same storage node as the source file? I think I know the > answer to this but just want to clarify. It's just a link. From the point of view of the directories, each file just lives "in the cloud", and is not associated with any particular storage nodes: each file has a "filecap" string, and directories are just lists of filecaps. Each file has shares on a set of storage nodes (a different set for each file). Directories are just special kinds of files, so directories also have shares on a set of storage nodes. The storage nodes used for a directory are unrelated to the ones used for the files therein. "Copying" an immutable file from one directory to another just creates a second link to that file. In fact, "uploading a file to a directory" actually has two steps: first the file is uploaded into the grid and returns a filecap, second the directory is modified (by adding the new filecap to its list). So copying from one directory to another just does the second step (modifies the target directory), and the original file isn't touched. Of course, copying a *mutable* file is different, because the copy must be a new object (changing the copy should not cause the original to change). In that case, the data itself must be copied. We don't yet support efficient large mutable files, and Tahoe uses immutable files by default, so in practice you don't tend to run into this very much. Hope that helps! Let us know how it goes! cheers, -Brian _______________________________________________ tahoe-dev mailing list tahoe-dev@allmydata.org http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev