At 2010-07-13 15:36 (-0400), Kyle Markley wrote: > Hey developers, > > I've been putting my 4-node grid through some stress and I've encountered > a few problems I wanted to report. > > 1) Sometimes I get backup operations failing like this:
The part of Tahoe-LAFS that is failing is the part that figures out where shares should be placed. What that does is ask each storage server if it will eventually hold some of the shares that will be generated when the file is encoded. The storage server will check its available space to make sure that it can hold all of the shares that it is asked to hold, and will refuse to hold shares that it does not have space for. It then tells the client which shares it will hold, and which it is already holding. The upload code in the client concluded that your storage server was full because the storage server refused to hold one or more of the shares that the client asked it to. This doesn't necessarily mean that the storage server is actually full (so maybe that error message should be reworded to say "of which 1 placed none due to the server not having enough free space", or something like that), only that the storage server is unable to accept a share of the size that your upload would generate. (I have opened bug #1116 [1] for the error message) >From the error message, and from a message you sent to the list before, I gather that you're using 2-of-4 encoding. Is that right? If so, each share generated from a particular source file will be about half the size of the source file. Does this happen with any particular files? If not, and if you notice it happening again, compare the source file size to the amount of free space available to Tahoe-LAFS on your storage servers -- if one of the servers has less free space available to Tahoe-LAFS than about half the size of the source file, then the storage server is probably right to reject the share, and the client is probably right to abort the upload. > This error report is incorrect -- all of the storage nodes show on their > status pages that they are still accepting new shares! Further, I've seen > that if I keep trying to restart the backup, the storage situation degrades > until eventually it says that all 4 shares couldn't be placed due to the > server being full. If I restart the tahoe node trying to run the backup, > this problem goes away, at least for a while. When a storage server accepts responsibility for a share during peer selection, it makes a placeholder file of the same size as it was asked to store. This means that the new share will be accounted for in future space accounting even if it hasn't been written yet. Unfortunately, it seems that the peer selection code doesn't tell the storage server that it won't be using the space that it allocated earlier when it fails, so the storage server fills up a little bit every time you try and fail to upload. You notice that it works again because the unused share file gets deleted (I think) when you restart the Tahoe-LAFS node, since the storage server at that point notices that you've disconnected and deletes the share file without being told to. This is a bug, and I've opened #1117 [2] to fix it. > 2) A long tahoe backup aborted with this error: [...] > assert len(buckets) == sum([len(peer.buckets) for peer in used_peers]) I've opened #1118 [3] to examine this issue. I think that that assert isn't worded quite right, since it doesn't consider the possibility that we might have allocated more buckets than we intend to use. Thanks for the reports, -- Kevan Carstensen | <[email protected]> [1] http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1116 [2] http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1117 [3] http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1118 _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
