Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Joseph Lynch
As far as I'm aware if you're using a high number of tokens per host you
can't bootstrap two hosts without potentially violating RaW consistency if
they have overlapping token ranges (with 256 this is basically guaranteed).
I'm definitely not an expert on this though, when I've used vnodes I've
always scaled up single node at a time.

Simultaneous bootstrap with a few (or one) tokens per node is very much
possible and is the fourth solution we proposed in the paper (to just
bootstrap wherever and then allow a token rebalancer to gradually shift the
cluster as proposed in CASSANDRA-1418
 into balance over
time, with little impact to the cluster). Personally I am really interested
in this approach a lot more than vnodes for balancing clusters because it
deals with hot partitions way better (in the edge case you end up with a
single node holding a single partition as that partition got hotter and
hotter), doesn't impact repair or gossip, can be easily controlled in
impact to the cluster, and would have incremental bootstrap for free
because you just bootstrap near another token and gradually move your token
over. If others agree this might be a useful direction to explore I think
we might be interested in working on this instead of moving to vnodes.

-Joey

On Tue, Apr 17, 2018 at 10:22 AM, Carl Mueller  wrote:

> Is this a fundamental vnode disadvantage:
>
> do Vnodes preclude cluster expansion faster than 1 at a time? I would think
> with manual management you could expand a datacenter by multiples of
> machines/nodes. Or at least in multiples of ReplicationFactor:
>
> RF3 starts as:
>
> a1 b1 c1
>
> doubles to:
>
> a1 a2 b1 b2 c1 c2
>
> expands again by 3:
>
> a1 a2 a3 b1 b2 b3 c1 c3 c3
>
> all via sneakernet or similar schemes? Or am I wrong about being able to do
> bigger expansions on manual tokens and that vnodes can't safely do that?
>
> Most of the paper seems to surround the streaming time being what exposes
> the cluster to risk. But manual tokens lend themselves to sneakernet
> rebuilds, do they not?
>
>
> On Tue, Apr 17, 2018 at 11:16 AM, Richard Low  wrote:
>
> > I'm also not convinced the problems listed in the paper with removenode
> are
> > so serious. With lots of vnodes per node, removenode causes data to be
> > streamed into all other nodes in parallel, so is (n-1) times quicker than
> > replacement for n nodes. For R=3, the failure rate goes up with vnodes
> > (without vnodes, after the first failure, any 4 neighbouring node
> failures
> > lose quorum but for vnodes, any other node failure loses quorum) by a
> > factor of (n-1)/4. The increase in speed more than offsets this so in
> fact
> > vnodes with removenode give theoretically 4x higher availability than no
> > vnodes.
> >
> > If anyone is interested in using vnodes in large clusters I'd strongly
> > suggest testing this out to see if the concerns in section 4.3.3 are
> valid.
> >
> > Richard.
> >
> > On 17 April 2018 at 08:29, Jeff Jirsa  wrote:
> >
> > > There are two huge advantages
> > >
> > > 1) during expansion / replacement / decom, you stream from far more
> > > ranges. Since streaming is single threaded per stream, this enables you
> > to
> > > max out machines during streaming where single token doesn’t
> > >
> > > 2) when adjusting the size of a cluster, you can often grow
> incrementally
> > > without rebalancing
> > >
> > > Streaming entire wholly covered/contained/owned sstables during range
> > > movements is probably a huge benefit in many use cases that may make
> the
> > > single threaded streaming implementation less of a concern, and likely
> > > works reasonably well without major changes to LCS in particular  - I’m
> > > fairly confident there’s a JIRA for this, if not it’s been discussed in
> > > person among various operators for years as an obvious future
> > improvement.
> > >
> > > --
> > > Jeff Jirsa
> > >
> > >
> > > > On Apr 17, 2018, at 8:17 AM, Carl Mueller <
> > carl.muel...@smartthings.com>
> > > wrote:
> > > >
> > > > Do Vnodes address anything besides alleviating cluster planners from
> > > doing
> > > > token range management on nodes manually? Do we have a centralized
> list
> > > of
> > > > advantages they provide beyond that?
> > > >
> > > > There seem to be lots of downsides. 2i index performance, the above
> > > > availability, etc.
> > > >
> > > > I also wonder if in vnodes (and manually managed tokens... I'll
> return
> > to
> > > > this) the node recovery scenarios are being hampered by sstables
> having
> > > the
> > > > hash ranges of the vnodes intermingled in the same set of sstables. I
> > > > wondered in another thread in vnodes why sstables are separated into
> > sets
> > > > by the vnode ranges they represent. For a manually managed contiguous
> > > token
> > > > range, you could separate the sstables into a fixed number of sets,
> > kind
> > > of
> > > > 

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Carl Mueller
Is this a fundamental vnode disadvantage:

do Vnodes preclude cluster expansion faster than 1 at a time? I would think
with manual management you could expand a datacenter by multiples of
machines/nodes. Or at least in multiples of ReplicationFactor:

RF3 starts as:

a1 b1 c1

doubles to:

a1 a2 b1 b2 c1 c2

expands again by 3:

a1 a2 a3 b1 b2 b3 c1 c3 c3

all via sneakernet or similar schemes? Or am I wrong about being able to do
bigger expansions on manual tokens and that vnodes can't safely do that?

Most of the paper seems to surround the streaming time being what exposes
the cluster to risk. But manual tokens lend themselves to sneakernet
rebuilds, do they not?


On Tue, Apr 17, 2018 at 11:16 AM, Richard Low  wrote:

> I'm also not convinced the problems listed in the paper with removenode are
> so serious. With lots of vnodes per node, removenode causes data to be
> streamed into all other nodes in parallel, so is (n-1) times quicker than
> replacement for n nodes. For R=3, the failure rate goes up with vnodes
> (without vnodes, after the first failure, any 4 neighbouring node failures
> lose quorum but for vnodes, any other node failure loses quorum) by a
> factor of (n-1)/4. The increase in speed more than offsets this so in fact
> vnodes with removenode give theoretically 4x higher availability than no
> vnodes.
>
> If anyone is interested in using vnodes in large clusters I'd strongly
> suggest testing this out to see if the concerns in section 4.3.3 are valid.
>
> Richard.
>
> On 17 April 2018 at 08:29, Jeff Jirsa  wrote:
>
> > There are two huge advantages
> >
> > 1) during expansion / replacement / decom, you stream from far more
> > ranges. Since streaming is single threaded per stream, this enables you
> to
> > max out machines during streaming where single token doesn’t
> >
> > 2) when adjusting the size of a cluster, you can often grow incrementally
> > without rebalancing
> >
> > Streaming entire wholly covered/contained/owned sstables during range
> > movements is probably a huge benefit in many use cases that may make the
> > single threaded streaming implementation less of a concern, and likely
> > works reasonably well without major changes to LCS in particular  - I’m
> > fairly confident there’s a JIRA for this, if not it’s been discussed in
> > person among various operators for years as an obvious future
> improvement.
> >
> > --
> > Jeff Jirsa
> >
> >
> > > On Apr 17, 2018, at 8:17 AM, Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> > >
> > > Do Vnodes address anything besides alleviating cluster planners from
> > doing
> > > token range management on nodes manually? Do we have a centralized list
> > of
> > > advantages they provide beyond that?
> > >
> > > There seem to be lots of downsides. 2i index performance, the above
> > > availability, etc.
> > >
> > > I also wonder if in vnodes (and manually managed tokens... I'll return
> to
> > > this) the node recovery scenarios are being hampered by sstables having
> > the
> > > hash ranges of the vnodes intermingled in the same set of sstables. I
> > > wondered in another thread in vnodes why sstables are separated into
> sets
> > > by the vnode ranges they represent. For a manually managed contiguous
> > token
> > > range, you could separate the sstables into a fixed number of sets,
> kind
> > of
> > > vnode-light.
> > >
> > > So if there was rebalancing or reconstruction, you could sneakernet or
> > > reliably send entire sstable sets that would belong in a range.
> > >
> > > I also thing this would improve compactions and repairs too.
> Compactions
> > > would be naturally parallelizable in all compaction schemes, and
> repairs
> > > would have natural subsets to do merkle tree calculations.
> > >
> > > Granted sending sstables might result in "overstreaming" due to data
> > > replication across the sstables, but you wouldn't have CPU and random
> I/O
> > > to look up the data. Just sequential transfers.
> > >
> > > For manually managed tokens with subdivided sstables, if there was
> > > rebalancing, you would have the "fringe" edges of the hash range
> > subdivided
> > > already, and you would only need to deal with the data in the border
> > areas
> > > of the token range, and again could sneakernet / dumb transfer the
> tables
> > > and then let the new node remove the unneeded in future repairs.
> > > (Compaction does not remove data that is not longer managed by a node,
> > only
> > > repair does? Or does only nodetool clean do that?)
> > >
> > > Pre-subdivided sstables for manually maanged tokens would REALLY pay
> big
> > > dividends in large-scale cluster expansion. Say you wanted to double or
> > > triple the cluster. Since the sstables are already split by some
> numeric
> > > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> > bulk
> > > copy the already-subdivided sstables for the new nodes' hash ranges and
> > > you'd basically be done. In AWS 

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Joseph Lynch
I'm pretty worried with large clusters using removenode given my experience
with Elasticsearch. Elasticsearch shard recovery is basically removenode +
bootstrap, and it does work really quickly if not throttled but it
completely destroys latency sensitive clusters (P99's spike to multiple
hundreds of milliseconds) if not limited to a maximum of 2-4 concurrent
shard recoveries
.
Since Cassandra doesn't have the ability to limit the number of nodes
impacted at a time during removenode, I'm concerned. I am interested if
there are any users at scale with >200 node clusters that require sub 10ms
p99 read latency and successfully use removenode under load.

Anecdotally I'm aware of large scale production users using every proposed
solution in the paper except for removenode (num_tokens=16, EBS data
volumes, range moves + simultaneous bootstrap).

-Joey

On Tue, Apr 17, 2018 at 9:16 AM, Richard Low  wrote:

> I'm also not convinced the problems listed in the paper with removenode are
> so serious. With lots of vnodes per node, removenode causes data to be
> streamed into all other nodes in parallel, so is (n-1) times quicker than
> replacement for n nodes. For R=3, the failure rate goes up with vnodes
> (without vnodes, after the first failure, any 4 neighbouring node failures
> lose quorum but for vnodes, any other node failure loses quorum) by a
> factor of (n-1)/4. The increase in speed more than offsets this so in fact
> vnodes with removenode give theoretically 4x higher availability than no
> vnodes.
>
> If anyone is interested in using vnodes in large clusters I'd strongly
> suggest testing this out to see if the concerns in section 4.3.3 are valid.
>
> Richard.
>
> On 17 April 2018 at 08:29, Jeff Jirsa  wrote:
>
> > There are two huge advantages
> >
> > 1) during expansion / replacement / decom, you stream from far more
> > ranges. Since streaming is single threaded per stream, this enables you
> to
> > max out machines during streaming where single token doesn’t
> >
> > 2) when adjusting the size of a cluster, you can often grow incrementally
> > without rebalancing
> >
> > Streaming entire wholly covered/contained/owned sstables during range
> > movements is probably a huge benefit in many use cases that may make the
> > single threaded streaming implementation less of a concern, and likely
> > works reasonably well without major changes to LCS in particular  - I’m
> > fairly confident there’s a JIRA for this, if not it’s been discussed in
> > person among various operators for years as an obvious future
> improvement.
> >
> > --
> > Jeff Jirsa
> >
> >
> > > On Apr 17, 2018, at 8:17 AM, Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> > >
> > > Do Vnodes address anything besides alleviating cluster planners from
> > doing
> > > token range management on nodes manually? Do we have a centralized list
> > of
> > > advantages they provide beyond that?
> > >
> > > There seem to be lots of downsides. 2i index performance, the above
> > > availability, etc.
> > >
> > > I also wonder if in vnodes (and manually managed tokens... I'll return
> to
> > > this) the node recovery scenarios are being hampered by sstables having
> > the
> > > hash ranges of the vnodes intermingled in the same set of sstables. I
> > > wondered in another thread in vnodes why sstables are separated into
> sets
> > > by the vnode ranges they represent. For a manually managed contiguous
> > token
> > > range, you could separate the sstables into a fixed number of sets,
> kind
> > of
> > > vnode-light.
> > >
> > > So if there was rebalancing or reconstruction, you could sneakernet or
> > > reliably send entire sstable sets that would belong in a range.
> > >
> > > I also thing this would improve compactions and repairs too.
> Compactions
> > > would be naturally parallelizable in all compaction schemes, and
> repairs
> > > would have natural subsets to do merkle tree calculations.
> > >
> > > Granted sending sstables might result in "overstreaming" due to data
> > > replication across the sstables, but you wouldn't have CPU and random
> I/O
> > > to look up the data. Just sequential transfers.
> > >
> > > For manually managed tokens with subdivided sstables, if there was
> > > rebalancing, you would have the "fringe" edges of the hash range
> > subdivided
> > > already, and you would only need to deal with the data in the border
> > areas
> > > of the token range, and again could sneakernet / dumb transfer the
> tables
> > > and then let the new node remove the unneeded in future repairs.
> > > (Compaction does not remove data that is not longer managed by a node,
> > only
> > > repair does? Or does only nodetool clean do that?)
> > >
> > > Pre-subdivided sstables for manually maanged tokens would REALLY pay
> big
> > > dividends in large-scale cluster expansion. Say you wanted to 

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Richard Low
I'm also not convinced the problems listed in the paper with removenode are
so serious. With lots of vnodes per node, removenode causes data to be
streamed into all other nodes in parallel, so is (n-1) times quicker than
replacement for n nodes. For R=3, the failure rate goes up with vnodes
(without vnodes, after the first failure, any 4 neighbouring node failures
lose quorum but for vnodes, any other node failure loses quorum) by a
factor of (n-1)/4. The increase in speed more than offsets this so in fact
vnodes with removenode give theoretically 4x higher availability than no
vnodes.

If anyone is interested in using vnodes in large clusters I'd strongly
suggest testing this out to see if the concerns in section 4.3.3 are valid.

Richard.

On 17 April 2018 at 08:29, Jeff Jirsa  wrote:

> There are two huge advantages
>
> 1) during expansion / replacement / decom, you stream from far more
> ranges. Since streaming is single threaded per stream, this enables you to
> max out machines during streaming where single token doesn’t
>
> 2) when adjusting the size of a cluster, you can often grow incrementally
> without rebalancing
>
> Streaming entire wholly covered/contained/owned sstables during range
> movements is probably a huge benefit in many use cases that may make the
> single threaded streaming implementation less of a concern, and likely
> works reasonably well without major changes to LCS in particular  - I’m
> fairly confident there’s a JIRA for this, if not it’s been discussed in
> person among various operators for years as an obvious future improvement.
>
> --
> Jeff Jirsa
>
>
> > On Apr 17, 2018, at 8:17 AM, Carl Mueller 
> wrote:
> >
> > Do Vnodes address anything besides alleviating cluster planners from
> doing
> > token range management on nodes manually? Do we have a centralized list
> of
> > advantages they provide beyond that?
> >
> > There seem to be lots of downsides. 2i index performance, the above
> > availability, etc.
> >
> > I also wonder if in vnodes (and manually managed tokens... I'll return to
> > this) the node recovery scenarios are being hampered by sstables having
> the
> > hash ranges of the vnodes intermingled in the same set of sstables. I
> > wondered in another thread in vnodes why sstables are separated into sets
> > by the vnode ranges they represent. For a manually managed contiguous
> token
> > range, you could separate the sstables into a fixed number of sets, kind
> of
> > vnode-light.
> >
> > So if there was rebalancing or reconstruction, you could sneakernet or
> > reliably send entire sstable sets that would belong in a range.
> >
> > I also thing this would improve compactions and repairs too. Compactions
> > would be naturally parallelizable in all compaction schemes, and repairs
> > would have natural subsets to do merkle tree calculations.
> >
> > Granted sending sstables might result in "overstreaming" due to data
> > replication across the sstables, but you wouldn't have CPU and random I/O
> > to look up the data. Just sequential transfers.
> >
> > For manually managed tokens with subdivided sstables, if there was
> > rebalancing, you would have the "fringe" edges of the hash range
> subdivided
> > already, and you would only need to deal with the data in the border
> areas
> > of the token range, and again could sneakernet / dumb transfer the tables
> > and then let the new node remove the unneeded in future repairs.
> > (Compaction does not remove data that is not longer managed by a node,
> only
> > repair does? Or does only nodetool clean do that?)
> >
> > Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> > dividends in large-scale cluster expansion. Say you wanted to double or
> > triple the cluster. Since the sstables are already split by some numeric
> > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> bulk
> > copy the already-subdivided sstables for the new nodes' hash ranges and
> > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > detach / drive attach.
> >
> >
> >
> >
> >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves 
> wrote:
> >>
> >> Great write up. Glad someone finally did the math for us. I don't think
> >> this will come as a surprise for many of the developers. Availability is
> >> only one issue raised by vnodes. Load distribution and performance are
> also
> >> pretty big concerns.
> >>
> >> I'm always a proponent for fixing vnodes, and removing them as a default
> >> until we do. Happy to help on this and we have ideas in mind that at
> some
> >> point I'll create tickets for...
> >>
> >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, 
> wrote:
> >>>
> >>> If the blob link on github doesn't work for the pdf (looks like mobile
> >>> might not like it), try:
> >>>
> >>>
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> 

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Carl Mueller
I've posted a bunch of things relevant to commitlog --> sstable and
associated compaction / sstable metadata changes on here. I really need to
learn that section of the code.

On Tue, Apr 17, 2018 at 10:29 AM, Jeff Jirsa  wrote:

> There are two huge advantages
>
> 1) during expansion / replacement / decom, you stream from far more
> ranges. Since streaming is single threaded per stream, this enables you to
> max out machines during streaming where single token doesn’t
>
> 2) when adjusting the size of a cluster, you can often grow incrementally
> without rebalancing
>
> Streaming entire wholly covered/contained/owned sstables during range
> movements is probably a huge benefit in many use cases that may make the
> single threaded streaming implementation less of a concern, and likely
> works reasonably well without major changes to LCS in particular  - I’m
> fairly confident there’s a JIRA for this, if not it’s been discussed in
> person among various operators for years as an obvious future improvement.
>
> --
> Jeff Jirsa
>
>
> > On Apr 17, 2018, at 8:17 AM, Carl Mueller 
> wrote:
> >
> > Do Vnodes address anything besides alleviating cluster planners from
> doing
> > token range management on nodes manually? Do we have a centralized list
> of
> > advantages they provide beyond that?
> >
> > There seem to be lots of downsides. 2i index performance, the above
> > availability, etc.
> >
> > I also wonder if in vnodes (and manually managed tokens... I'll return to
> > this) the node recovery scenarios are being hampered by sstables having
> the
> > hash ranges of the vnodes intermingled in the same set of sstables. I
> > wondered in another thread in vnodes why sstables are separated into sets
> > by the vnode ranges they represent. For a manually managed contiguous
> token
> > range, you could separate the sstables into a fixed number of sets, kind
> of
> > vnode-light.
> >
> > So if there was rebalancing or reconstruction, you could sneakernet or
> > reliably send entire sstable sets that would belong in a range.
> >
> > I also thing this would improve compactions and repairs too. Compactions
> > would be naturally parallelizable in all compaction schemes, and repairs
> > would have natural subsets to do merkle tree calculations.
> >
> > Granted sending sstables might result in "overstreaming" due to data
> > replication across the sstables, but you wouldn't have CPU and random I/O
> > to look up the data. Just sequential transfers.
> >
> > For manually managed tokens with subdivided sstables, if there was
> > rebalancing, you would have the "fringe" edges of the hash range
> subdivided
> > already, and you would only need to deal with the data in the border
> areas
> > of the token range, and again could sneakernet / dumb transfer the tables
> > and then let the new node remove the unneeded in future repairs.
> > (Compaction does not remove data that is not longer managed by a node,
> only
> > repair does? Or does only nodetool clean do that?)
> >
> > Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> > dividends in large-scale cluster expansion. Say you wanted to double or
> > triple the cluster. Since the sstables are already split by some numeric
> > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> bulk
> > copy the already-subdivided sstables for the new nodes' hash ranges and
> > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > detach / drive attach.
> >
> >
> >
> >
> >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves 
> wrote:
> >>
> >> Great write up. Glad someone finally did the math for us. I don't think
> >> this will come as a surprise for many of the developers. Availability is
> >> only one issue raised by vnodes. Load distribution and performance are
> also
> >> pretty big concerns.
> >>
> >> I'm always a proponent for fixing vnodes, and removing them as a default
> >> until we do. Happy to help on this and we have ideas in mind that at
> some
> >> point I'll create tickets for...
> >>
> >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, 
> wrote:
> >>>
> >>> If the blob link on github doesn't work for the pdf (looks like mobile
> >>> might not like it), try:
> >>>
> >>>
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>
> >>> -Joey
> >>> <
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> 
> >>>
> >>> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch 
> >>> wrote:
> >>>
>  Josh Snyder and I have been working on evaluating virtual nodes for
> >> large
>  scale deployments and while it seems like there is a lot of anecdotal
>  support for reducing the vnode count [1], we 

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Jeff Jirsa
There are two huge advantages 

1) during expansion / replacement / decom, you stream from far more ranges. 
Since streaming is single threaded per stream, this enables you to max out 
machines during streaming where single token doesn’t

2) when adjusting the size of a cluster, you can often grow incrementally 
without rebalancing 

Streaming entire wholly covered/contained/owned sstables during range movements 
is probably a huge benefit in many use cases that may make the single threaded 
streaming implementation less of a concern, and likely works reasonably well 
without major changes to LCS in particular  - I’m fairly confident there’s a 
JIRA for this, if not it’s been discussed in person among various operators for 
years as an obvious future improvement. 

-- 
Jeff Jirsa


> On Apr 17, 2018, at 8:17 AM, Carl Mueller  
> wrote:
> 
> Do Vnodes address anything besides alleviating cluster planners from doing
> token range management on nodes manually? Do we have a centralized list of
> advantages they provide beyond that?
> 
> There seem to be lots of downsides. 2i index performance, the above
> availability, etc.
> 
> I also wonder if in vnodes (and manually managed tokens... I'll return to
> this) the node recovery scenarios are being hampered by sstables having the
> hash ranges of the vnodes intermingled in the same set of sstables. I
> wondered in another thread in vnodes why sstables are separated into sets
> by the vnode ranges they represent. For a manually managed contiguous token
> range, you could separate the sstables into a fixed number of sets, kind of
> vnode-light.
> 
> So if there was rebalancing or reconstruction, you could sneakernet or
> reliably send entire sstable sets that would belong in a range.
> 
> I also thing this would improve compactions and repairs too. Compactions
> would be naturally parallelizable in all compaction schemes, and repairs
> would have natural subsets to do merkle tree calculations.
> 
> Granted sending sstables might result in "overstreaming" due to data
> replication across the sstables, but you wouldn't have CPU and random I/O
> to look up the data. Just sequential transfers.
> 
> For manually managed tokens with subdivided sstables, if there was
> rebalancing, you would have the "fringe" edges of the hash range subdivided
> already, and you would only need to deal with the data in the border areas
> of the token range, and again could sneakernet / dumb transfer the tables
> and then let the new node remove the unneeded in future repairs.
> (Compaction does not remove data that is not longer managed by a node, only
> repair does? Or does only nodetool clean do that?)
> 
> Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> dividends in large-scale cluster expansion. Say you wanted to double or
> triple the cluster. Since the sstables are already split by some numeric
> factor that has lots of even divisors (60 for RF 2,3,4,5), you simply bulk
> copy the already-subdivided sstables for the new nodes' hash ranges and
> you'd basically be done. In AWS EBS volumes, that could just be a drive
> detach / drive attach.
> 
> 
> 
> 
>> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves  wrote:
>> 
>> Great write up. Glad someone finally did the math for us. I don't think
>> this will come as a surprise for many of the developers. Availability is
>> only one issue raised by vnodes. Load distribution and performance are also
>> pretty big concerns.
>> 
>> I'm always a proponent for fixing vnodes, and removing them as a default
>> until we do. Happy to help on this and we have ideas in mind that at some
>> point I'll create tickets for...
>> 
>>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch,  wrote:
>>> 
>>> If the blob link on github doesn't work for the pdf (looks like mobile
>>> might not like it), try:
>>> 
>>> 
>>> https://github.com/jolynch/python_performance_toolkit/
>> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
>> availability-virtual.pdf
>>> 
>>> -Joey
>>> <
>>> https://github.com/jolynch/python_performance_toolkit/
>> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
>> availability-virtual.pdf
 
>>> 
>>> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch 
>>> wrote:
>>> 
 Josh Snyder and I have been working on evaluating virtual nodes for
>> large
 scale deployments and while it seems like there is a lot of anecdotal
 support for reducing the vnode count [1], we couldn't find any concrete
 math on the topic, so we had some fun and took a whack at quantifying
>> how
 different choices of num_tokens impact a Cassandra cluster.
 
 According to the model we developed [2] it seems that at small cluster
 sizes there isn't much of a negative impact on availability, but when
 clusters scale up to hundreds of hosts, vnodes have a major impact on
 availability. In 

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Carl Mueller
Do Vnodes address anything besides alleviating cluster planners from doing
token range management on nodes manually? Do we have a centralized list of
advantages they provide beyond that?

There seem to be lots of downsides. 2i index performance, the above
availability, etc.

I also wonder if in vnodes (and manually managed tokens... I'll return to
this) the node recovery scenarios are being hampered by sstables having the
hash ranges of the vnodes intermingled in the same set of sstables. I
wondered in another thread in vnodes why sstables are separated into sets
by the vnode ranges they represent. For a manually managed contiguous token
range, you could separate the sstables into a fixed number of sets, kind of
vnode-light.

So if there was rebalancing or reconstruction, you could sneakernet or
reliably send entire sstable sets that would belong in a range.

I also thing this would improve compactions and repairs too. Compactions
would be naturally parallelizable in all compaction schemes, and repairs
would have natural subsets to do merkle tree calculations.

Granted sending sstables might result in "overstreaming" due to data
replication across the sstables, but you wouldn't have CPU and random I/O
to look up the data. Just sequential transfers.

For manually managed tokens with subdivided sstables, if there was
rebalancing, you would have the "fringe" edges of the hash range subdivided
already, and you would only need to deal with the data in the border areas
of the token range, and again could sneakernet / dumb transfer the tables
and then let the new node remove the unneeded in future repairs.
(Compaction does not remove data that is not longer managed by a node, only
repair does? Or does only nodetool clean do that?)

Pre-subdivided sstables for manually maanged tokens would REALLY pay big
dividends in large-scale cluster expansion. Say you wanted to double or
triple the cluster. Since the sstables are already split by some numeric
factor that has lots of even divisors (60 for RF 2,3,4,5), you simply bulk
copy the already-subdivided sstables for the new nodes' hash ranges and
you'd basically be done. In AWS EBS volumes, that could just be a drive
detach / drive attach.




On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves  wrote:

> Great write up. Glad someone finally did the math for us. I don't think
> this will come as a surprise for many of the developers. Availability is
> only one issue raised by vnodes. Load distribution and performance are also
> pretty big concerns.
>
> I'm always a proponent for fixing vnodes, and removing them as a default
> until we do. Happy to help on this and we have ideas in mind that at some
> point I'll create tickets for...
>
> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch,  wrote:
>
> > If the blob link on github doesn't work for the pdf (looks like mobile
> > might not like it), try:
> >
> >
> > https://github.com/jolynch/python_performance_toolkit/
> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
> >
> > -Joey
> > <
> > https://github.com/jolynch/python_performance_toolkit/
> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
> > >
> >
> > On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch 
> > wrote:
> >
> > > Josh Snyder and I have been working on evaluating virtual nodes for
> large
> > > scale deployments and while it seems like there is a lot of anecdotal
> > > support for reducing the vnode count [1], we couldn't find any concrete
> > > math on the topic, so we had some fun and took a whack at quantifying
> how
> > > different choices of num_tokens impact a Cassandra cluster.
> > >
> > > According to the model we developed [2] it seems that at small cluster
> > > sizes there isn't much of a negative impact on availability, but when
> > > clusters scale up to hundreds of hosts, vnodes have a major impact on
> > > availability. In particular, the probability of outage during short
> > > failures (e.g. process restarts or failures) or permanent failure (e.g.
> > > disk or machine failure) appears to be orders of magnitude higher for
> > large
> > > clusters.
> > >
> > > The model attempts to explain why we may care about this and advances a
> > > few existing/new ideas for how to fix the scalability problems that
> > vnodes
> > > fix without the availability (and consistency—due to the effects on
> > repair)
> > > problems high num_tokens create. We would of course be very interested
> in
> > > any feedback. The model source code is on github [3], PRs are welcome
> or
> > > feel free to play around with the jupyter notebook to match your
> > > environment and see what the graphs look like. I didn't attach the pdf
> > here
> > > because it's too large apparently (lots of pretty graphs).
> > >
> > > I know that users can always just pick whichever number they prefer,
> but
> > I
> > > think the current default was 

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread kurt greaves
Great write up. Glad someone finally did the math for us. I don't think
this will come as a surprise for many of the developers. Availability is
only one issue raised by vnodes. Load distribution and performance are also
pretty big concerns.

I'm always a proponent for fixing vnodes, and removing them as a default
until we do. Happy to help on this and we have ideas in mind that at some
point I'll create tickets for...

On Tue., 17 Apr. 2018, 06:16 Joseph Lynch,  wrote:

> If the blob link on github doesn't work for the pdf (looks like mobile
> might not like it), try:
>
>
> https://github.com/jolynch/python_performance_toolkit/raw/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf
>
> -Joey
> <
> https://github.com/jolynch/python_performance_toolkit/raw/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf
> >
>
> On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch 
> wrote:
>
> > Josh Snyder and I have been working on evaluating virtual nodes for large
> > scale deployments and while it seems like there is a lot of anecdotal
> > support for reducing the vnode count [1], we couldn't find any concrete
> > math on the topic, so we had some fun and took a whack at quantifying how
> > different choices of num_tokens impact a Cassandra cluster.
> >
> > According to the model we developed [2] it seems that at small cluster
> > sizes there isn't much of a negative impact on availability, but when
> > clusters scale up to hundreds of hosts, vnodes have a major impact on
> > availability. In particular, the probability of outage during short
> > failures (e.g. process restarts or failures) or permanent failure (e.g.
> > disk or machine failure) appears to be orders of magnitude higher for
> large
> > clusters.
> >
> > The model attempts to explain why we may care about this and advances a
> > few existing/new ideas for how to fix the scalability problems that
> vnodes
> > fix without the availability (and consistency—due to the effects on
> repair)
> > problems high num_tokens create. We would of course be very interested in
> > any feedback. The model source code is on github [3], PRs are welcome or
> > feel free to play around with the jupyter notebook to match your
> > environment and see what the graphs look like. I didn't attach the pdf
> here
> > because it's too large apparently (lots of pretty graphs).
> >
> > I know that users can always just pick whichever number they prefer, but
> I
> > think the current default was chosen when token placement was random,
> and I
> > wonder whether it's still the right default.
> >
> > Thank you,
> > -Joey Lynch
> >
> > [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> > [2] https://github.com/jolynch/python_performance_toolkit/
> > raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> > availability-virtual.pdf
> >
> > <
> https://github.com/jolynch/python_performance_toolkit/blob/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf
> >
> > [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> > aster/notebooks/cassandra_availability
> >
>


Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-16 Thread Joseph Lynch
If the blob link on github doesn't work for the pdf (looks like mobile
might not like it), try:

https://github.com/jolynch/python_performance_toolkit/raw/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf

-Joey


On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch  wrote:

> Josh Snyder and I have been working on evaluating virtual nodes for large
> scale deployments and while it seems like there is a lot of anecdotal
> support for reducing the vnode count [1], we couldn't find any concrete
> math on the topic, so we had some fun and took a whack at quantifying how
> different choices of num_tokens impact a Cassandra cluster.
>
> According to the model we developed [2] it seems that at small cluster
> sizes there isn't much of a negative impact on availability, but when
> clusters scale up to hundreds of hosts, vnodes have a major impact on
> availability. In particular, the probability of outage during short
> failures (e.g. process restarts or failures) or permanent failure (e.g.
> disk or machine failure) appears to be orders of magnitude higher for large
> clusters.
>
> The model attempts to explain why we may care about this and advances a
> few existing/new ideas for how to fix the scalability problems that vnodes
> fix without the availability (and consistency—due to the effects on repair)
> problems high num_tokens create. We would of course be very interested in
> any feedback. The model source code is on github [3], PRs are welcome or
> feel free to play around with the jupyter notebook to match your
> environment and see what the graphs look like. I didn't attach the pdf here
> because it's too large apparently (lots of pretty graphs).
>
> I know that users can always just pick whichever number they prefer, but I
> think the current default was chosen when token placement was random, and I
> wonder whether it's still the right default.
>
> Thank you,
> -Joey Lynch
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-13701
> [2] https://github.com/jolynch/python_performance_toolkit/
> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
>
> 
> [3] https://github.com/jolynch/python_performance_toolkit/tree/m
> aster/notebooks/cassandra_availability
>


Quantifying Virtual Node Impact on Cassandra Availability

2018-04-16 Thread Joseph Lynch
Josh Snyder and I have been working on evaluating virtual nodes for large
scale deployments and while it seems like there is a lot of anecdotal
support for reducing the vnode count [1], we couldn't find any concrete
math on the topic, so we had some fun and took a whack at quantifying how
different choices of num_tokens impact a Cassandra cluster.

According to the model we developed [2] it seems that at small cluster
sizes there isn't much of a negative impact on availability, but when
clusters scale up to hundreds of hosts, vnodes have a major impact on
availability. In particular, the probability of outage during short
failures (e.g. process restarts or failures) or permanent failure (e.g.
disk or machine failure) appears to be orders of magnitude higher for large
clusters.

The model attempts to explain why we may care about this and advances a few
existing/new ideas for how to fix the scalability problems that vnodes fix
without the availability (and consistency—due to the effects on repair)
problems high num_tokens create. We would of course be very interested in
any feedback. The model source code is on github [3], PRs are welcome or
feel free to play around with the jupyter notebook to match your
environment and see what the graphs look like. I didn't attach the pdf here
because it's too large apparently (lots of pretty graphs).

I know that users can always just pick whichever number they prefer, but I
think the current default was chosen when token placement was random, and I
wonder whether it's still the right default.

Thank you,
-Joey Lynch

[1] https://issues.apache.org/jira/browse/CASSANDRA-13701
[2]
https://github.com/jolynch/python_performance_toolkit/raw/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf

[3] https://github.com/jolynch/python_performance_toolkit/tree/
master/notebooks/cassandra_availability