Re: [tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Greg Troxel Fri, 24 Sep 2010 16:39:08 -0700

"Zooko O'Whielacronx" <zo...@zooko.com> writes:

> On the bright side, writing this letter has shown me a solution! Set M
> = the number of servers on your grid (while keeping K/M the same as
> it was before). So if you have 100 servers on your grid, set K=30,
> H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of
> servers which can fail and cause any file or directory to fail.


There's a semi-related reliability issue, which is that a grid of N
servers which are each available most of the time should allow a user to
store, check and repair without a lot of churn.  So rather than setting
M to N, I'd want to (without precise justification) set it to 0.8N or
something, and use the not-yet-implemented facility to sprinkle more
shares during repair without invalidating the ones that are there.

For those that didn't follow my ticket, the example is

  assume 3/7/10.

  10 shares seqN placed onto 15 servers

  check finds 9 shares, because only 13/15 are up.

  currently, this results in writing 10 shares of seqN+1 on those 13

  next check, a different 13 are up, repeat


Instead, of check synthesized the missing share and placed it, then
there would be two copies of one share and still 10 reachable shares and
then as servers fade in and out the verify process can still succeed.
So for a/b availabilty and M shares, we know have M*b/a copies of M
shares placed, and I think that's ok.

pgpXukcMki1u7.pgp
Description: PGP signature

_______________________________________________
tahoe-dev mailing list
tahoe-dev@tahoe-lafs.org
http://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev

Re: [tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Reply via email to