Hey all,

At some point not too long ago I spent some time trying to
make the token allocation algorithm the default.

I didn't foresee it, although it might be obvious for many of
you, but one corollary of the way the algorithm works (or more
precisely might not work) with multiple seeds or simultaneous
multi-node bootstraps or decommissions, is that a lot of dtests
start failing due to deterministic token conflicts. I wasn't
able to fix that by changing solely ccm and the dtests, unless
careful, sequential node bootstrap was enforced. While it's strongly
suggested to users to do exactly that in the real world, it would
have exploded dtest run times to unacceptable levels.

I have to clarify that what I'm working with is not exactly
C*, and my knowledge of the C* codebase is not as up to date as
I would want it to, but I suspect that the above problem might very
well affect C* too, in which case changing the defaults might
be a less-than-trivial undertaking.

Regards,
Dimitar

На пт, 31.01.2020 г. в 17:20 Joshua McKenzie <jmcken...@apache.org> написа:

> >
> > We should be using the default value that benefits the most people,
> rather
> > than an arbitrary compromise.
>
> I'd caution we're talking about the default value *we believe* will benefit
> the most people according to our respective understandings of C* usage.
>
>  Most clusters don't shrink, they stay the same size or grow. I'd say 90%
> > or more fall in this category.
>
> While I agree with the "most don't shrink, they stay the same or grow"
> claim intuitively, there's a distinct difference impacting the 4 vs. 16
> debate between what ratio we think stays the same size and what ratio we
> think grows that I think informs this discussion.
>
> There's a *lot* of Cassandra out in the world, and these changes are going
> to impact all of it. I'm not advocating a certain position on 4 vs. 16, but
> I do think we need to be very careful about how strongly we hold our
> beliefs and present them as facts in discussions like this.
>
> For my unsolicited .02, it sounds an awful lot like we're stuck between a
> rock and a hard place in that there is no correct "one size fits all"
> answer here (or, said another way: both 4 and 16 are correct, just for
> different cases and we don't know / agree on which one we think is the
> right one to target), so perhaps a discussion on a smart evolution of token
> allocation counts based on quantized tiers of cluster size and dataset
> growth (either automated or through operational best practices) could be
> valuable along with this.
>
> On Fri, Jan 31, 2020 at 8:57 AM Alexander Dejanovski <
> a...@thelastpickle.com>
> wrote:
>
> > While I (mostly) understand the maths behind using 4 vnodes as a default
> > (which really is a question of extreme availability), I don't think they
> > provide noticeable performance improvements over using 16, while 16
> vnodes
> > will protect folks from imbalances. It is very hard to deal with
> unbalanced
> > clusters, and people start to deal with it once some nodes are already
> > close to being full. Operationally, it's far from trivial.
> > We're going to make some experiments at bootstrapping clusters with 4
> > tokens on the latest alpha to see how much balance we can expect, and how
> > removing one node could impact it.
> >
> > If we're talking about repairs, using 4 vnodes will generate
> overstreaming,
> > which can create lots of serious performance issues. Even on clusters
> with
> > 500GB of node density, we never use less than ~15 segments per node with
> > Reaper.
> > Not everyone uses Reaper, obviously, and there will be no protection
> > against overstreaming with such a low default for folks not using
> subrange
> > repairs.
> > On small clusters, even with 256 vnodes, using Cassandra 3.0/3.x and
> Reaper
> > already allows to get good repair performance because token ranges
> sharing
> > the exact same replicas will be processed in a single repair session. On
> > large clusters, I reckon it's good to have way less vnodes to speed up
> > repairs.
> >
> > Cassandra 4.0 is supposed to aim at providing a rock stable release of
> > Cassandra, fixing past instabilities, and I think lowering to 4 tokens by
> > default defeats that purpose.
> > 16 tokens is a reasonable compromise for clusters of all sizes, without
> > being too aggressive. Those with enough C* experience can still lower
> that
> > number for their clusters.
> >
> > Cheers,
> >
> > -----------------
> > Alexander Dejanovski
> > France
> > @alexanderdeja
> >
> > Consultant
> > Apache Cassandra Consulting
> > http://www.thelastpickle.com
> >
> >
> > On Fri, Jan 31, 2020 at 1:41 PM Mick Semb Wever <m...@apache.org> wrote:
> >
> > >
> > > > TLDR, based on availability concerns, skew concerns, operational
> > > > concerns, and based on the fact that the new allocation algorithm can
> > > > be configured fairly simply now, this is a proposal to go with 4 as
> the
> > > > new default and the allocate_tokens_for_local_replication_factor set
> to
> > > > 3.
> > >
> > >
> > > I'm uncomfortable going with the default of `num_tokens: 4`.
> > > I would rather see a default of `num_tokens: 16` based on the
> following…
> > >
> > > a) 4 num_tokens does not provide a good out-of-the-box experience.
> > > b) 4 num_tokens doesn't provide any significant streaming benefits over
> > 16.
> > > c)  edge-case availability doesn't trump (a) & (b)
> > >
> > >
> > > For (a)…
> > >  The first node in each rack, up to RF racks, in each datacenter can't
> > use
> > > the allocation strategy. With 4 num_tokens, 3 racks and RF=3, the first
> > > three nodes will be poorly balanced. If three poorly unbalanced nodes
> in
> > a
> > > cluster is an issue (because the cluster is small enough) therefore 4
> is
> > > the wrong default. From our own experience, we have had to bootstrap
> > these
> > > nodes multiple times until they generate something ok. In practice 4
> > > num_tokens (over 16) has provided more headache with clients than gain.
> > >
> > > Elaborating, 256 was originally chosen because the token randomness
> over
> > > that many always averaged out. With a default of
> > > `allocate_tokens_for_local_replication_factor: 3` this issue is largely
> > > solved, but you will still have those initial nodes with randomly
> > generated
> > > tokens. Ref:
> > >
> >
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/dht/tokenallocator/ReplicationAwareTokenAllocator.java#L80
> > > And to be precise: tokens are randomly generated until there is a node
> in
> > > each rack up to RF racks. So, if you have RF=3, in theory (or are a
> > newbie)
> > > you could boot 100 nodes only in the first two racks, and they will all
> > be
> > > random tokens regardless of the
> > > allocate_tokens_for_local_replication_factor setting.
> > >
> > > For example, using 4 num_tokens, 3 racks and RF=3…
> > >  - in a 6 node cluster, there's a total of 24 tokens, half of which are
> > > random,
> > >  - in a 9 node cluster, there's a total of 36 tokens, a third of which
> is
> > > random,
> > >  - etc
> > >
> > > Following this logic i would not be willing to apply 4 unless you know
> > > there will be more than 36 nodes in each data centre, ie less than ~8%
> of
> > > your tokens are randomly generated. Many clusters don't have that size,
> > and
> > > imho that's why 4 is a bad default.
> > >
> > > A default of 16 by the same logic only needs 9 nodes in each dc to
> > > overcome that randomness degree.
> > >
> > > The workaround to all this is having to manually define `initial_token:
> > …`
> > > on those initial nodes. I'm really not inspired imposing that upon new
> > > users.
> > >
> > > For (b)…
> > >  there's been a number of improvements already around streaming that
> > > solves much of what would be any difference there is between 4 and 16
> > > num_tokens. And 4 num_tokens means bigger token ranges so could well be
> > > disadvantageous due to over-streaming.
> > >
> > > For (c)…
> > >  we are trying to optimise availability in situations we can never
> > > guarantee availability. I understand it's a nice operational advantage
> to
> > > have in a shit-show, but it's not a systems design that you can design
> > and
> > > rely upon. There's also the question of availability vs the size of the
> > > token-range that becomes unavailable.
> > >
> > >
> > >
> > > regards,
> > > Mick
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > >
> >
>

Reply via email to