Re: [DISCUSS] Client protocol changes (Was: 20200217 4.0 Status Update)

2020-02-18 Thread Benedict Elliott Smith
Behaviours don't have to be switched only with a new protocol version; it's 
possible to support optional feature/modifier flags, the support for which is 
negotiated with a client on connection.

A protocol version change seems reasonable to limit to major releases, but a 
protocol feature seems perfectly reasonable to introduce in a minor, I think?  
Ideally a version change would only be necessary for forced 
deprecation/standardisation of features, behaviour and stream encodings.


On 18/02/2020, 21:53, "Jeff Jirsa"  wrote:

A few notes:

- Protocol changes add work to the rest of the ecosystem. Drivers have to
update, etc.
- Nobody expects protocol changes in minors, though it's less of a concern
if we don't deprecate out the older version. E.g. if 4.0 launches with
protocol v4 and protocol v5, and then 4.0.2 adds protocol v6, do we
deprecate out v4? If yes, you potentially break clients that only supported
v3 and v4 in a minor version upgrade, which is unexpected. If not, how many
protocol versions are you willing to support at any given time?
- Having protocol changes introduces risk. Paging behavior across protocol
versions is the site of a number of different bugs recently.


On Tue, Feb 18, 2020 at 1:46 PM Tolbert, Andrew  
wrote:

> I don't know the technical answer, but I suspect two motivations for
> doing new protocol versions in major releases could include:
>
> * protocol changes can be tied to feature changes that typically come
> in a major release.
> * protocol changes should be as infrequent as major releases.  Each
> new protocol version is another thing in the test matrix that needs to
> be tested.
>
> That last point can make it hard to get new changes in. If something
> doesn't make the upcoming protocol version, it might be years before
> another one, but I also think it's worth it to do this infrequently as
> it makes maintaining client and server code easier if there are less
> protocol versions to worry about.
>
> On the client-side, libraries themselves should be avoiding making
> Cassandra version checks when detecting capabilities.  There are a few
> exceptions, such as system table parsing for schema & peers,
> but those aren't related to the protocol.
>
> Thanks,
> Andy
>
>
>
>
>
> On Tue, Feb 18, 2020 at 1:22 PM Nate McCall  wrote:
> >
> > [Moving to new message thread]
> >
> > Thanks for bringing this up, Jordan.
> >
> > IIRC, this was more a convention than a technical reason. Though I could
> be
> > completely misremembering this.
> >
> > -- Forwarded message -
> > From: Jordan West 
> > Date: Wed, Feb 19, 2020 at 10:13 AM
> > Subject: Re: 20200217 4.0 Status Update
> > To: 
> >
> >
> > On Mon, Feb 17, 2020 at 12:52 PM Jeff Jirsa  wrote:
> >
> > >
> > > beyond the client proto change being painful for anything other than
> major
> > > releases
> > >
> > >
> > This came up during the community meeting today and I wanted to bring a
> > question about it to the list: could someone who is *very* familiar with
> > the client proto share w/ the list why changing the proto in anything
> other
> > than a major release is so difficult? I hear this a lot and it seems to
> be
> > fact. So that all of us don't have to go read the code, a brief summary
> > would be super helpful. Or if there is a ticket that already covers this
> > even better! I'd also be curious if there have ever been any thoughts to
> > address it as it seems to be a consistent hurdle during the release 
cycle
> > and one that tends to further increase scope.
> >
> > Thanks,
> > Jordan
> >
> > >
> > >
> > > > On Feb 17, 2020, at 12:43 PM, Jon Meredith 
> > > wrote:
> > > >
> > > > My turn to give an update on 4.0 status. The 4.0 board created by
> Josh
> > > can
> > > > be found at
> > > >
> > > >
> > > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> > > >
> > > >
> > > > We have 94 unresolved tickets marked against the 4.0 release. [1]
> > > >
> > > >
> > > > Things seem to have settled into a phase of working to resolve
> issues,
> > > with
> > > > few new issues added.
> > > >
> > > >
> > > > 2 new tickets opened (that are marked against 4.0)
> > > >
> > > > 11 tickets closed (including one of the newly opened ones)
> > > >
> > > > 39 tickets received updates to JIRA of some kind in the last week
> > > >
> > > >
> > > > Cumulative flow over the last couple of weeks shows todo reducing 
and
> > > done
> > > > increasing as it should as we continue to close out work for the
> > release.
> > > >
> > > 

Re: [DISCUSS] Client protocol changes (Was: 20200217 4.0 Status Update)

2020-02-18 Thread David Capwell
Given the JIRA in question, if you want to override the timeout to lower
it, then the worst case if not supported yet is that you get the default
timeout.  So this then makes me wonder "is there a way to add metadata to a
message which is ignored if unknown" (aka forward compatibility).  Skimming
the frame code i see we have

boolean isCustomPayload =
frame.header.flags.contains(Frame.Header.Flag.CUSTOM_PAYLOAD);
boolean hasWarning = frame.header.flags.contains(Frame.Header.Flag.WARNING);

UUID tracingId = isRequest || !isTracing ? null : CBUtil.readUUID(frame.body);
List warnings = isRequest || !hasWarning ? null :
CBUtil.readStringList(frame.body);
Map customPayload = !isCustomPayload ? null :
CBUtil.readBytesMap(frame.body);

This makes me wonder if we could picky back off that for new features, that
way older servers just ignore them. I have no idea of the negatives of
customPayload (other than strings are more bytes for messages, evolution
may be based off key names so annoying, etc.), but tags which are ignored
sounds promising


On Tue, Feb 18, 2020 at 1:53 PM Jeff Jirsa  wrote:

> A few notes:
>
> - Protocol changes add work to the rest of the ecosystem. Drivers have to
> update, etc.
> - Nobody expects protocol changes in minors, though it's less of a concern
> if we don't deprecate out the older version. E.g. if 4.0 launches with
> protocol v4 and protocol v5, and then 4.0.2 adds protocol v6, do we
> deprecate out v4? If yes, you potentially break clients that only supported
> v3 and v4 in a minor version upgrade, which is unexpected. If not, how many
> protocol versions are you willing to support at any given time?
> - Having protocol changes introduces risk. Paging behavior across protocol
> versions is the site of a number of different bugs recently.
>
>
> On Tue, Feb 18, 2020 at 1:46 PM Tolbert, Andrew 
> wrote:
>
> > I don't know the technical answer, but I suspect two motivations for
> > doing new protocol versions in major releases could include:
> >
> > * protocol changes can be tied to feature changes that typically come
> > in a major release.
> > * protocol changes should be as infrequent as major releases.  Each
> > new protocol version is another thing in the test matrix that needs to
> > be tested.
> >
> > That last point can make it hard to get new changes in. If something
> > doesn't make the upcoming protocol version, it might be years before
> > another one, but I also think it's worth it to do this infrequently as
> > it makes maintaining client and server code easier if there are less
> > protocol versions to worry about.
> >
> > On the client-side, libraries themselves should be avoiding making
> > Cassandra version checks when detecting capabilities.  There are a few
> > exceptions, such as system table parsing for schema & peers,
> > but those aren't related to the protocol.
> >
> > Thanks,
> > Andy
> >
> >
> >
> >
> >
> > On Tue, Feb 18, 2020 at 1:22 PM Nate McCall  wrote:
> > >
> > > [Moving to new message thread]
> > >
> > > Thanks for bringing this up, Jordan.
> > >
> > > IIRC, this was more a convention than a technical reason. Though I
> could
> > be
> > > completely misremembering this.
> > >
> > > -- Forwarded message -
> > > From: Jordan West 
> > > Date: Wed, Feb 19, 2020 at 10:13 AM
> > > Subject: Re: 20200217 4.0 Status Update
> > > To: 
> > >
> > >
> > > On Mon, Feb 17, 2020 at 12:52 PM Jeff Jirsa  wrote:
> > >
> > > >
> > > > beyond the client proto change being painful for anything other than
> > major
> > > > releases
> > > >
> > > >
> > > This came up during the community meeting today and I wanted to bring a
> > > question about it to the list: could someone who is *very* familiar
> with
> > > the client proto share w/ the list why changing the proto in anything
> > other
> > > than a major release is so difficult? I hear this a lot and it seems to
> > be
> > > fact. So that all of us don't have to go read the code, a brief summary
> > > would be super helpful. Or if there is a ticket that already covers
> this
> > > even better! I'd also be curious if there have ever been any thoughts
> to
> > > address it as it seems to be a consistent hurdle during the release
> cycle
> > > and one that tends to further increase scope.
> > >
> > > Thanks,
> > > Jordan
> > >
> > > >
> > > >
> > > > > On Feb 17, 2020, at 12:43 PM, Jon Meredith 
> > > > wrote:
> > > > >
> > > > > My turn to give an update on 4.0 status. The 4.0 board created by
> > Josh
> > > > can
> > > > > be found at
> > > > >
> > > > >
> > > > >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> > > > >
> > > > >
> > > > > We have 94 unresolved tickets marked against the 4.0 release. [1]
> > > > >
> > > > >
> > > > > Things seem to have settled into a phase of working to resolve
> > issues,
> > > > with
> > > > > few new issues added.
> > > > >
> > > > >
> > > > > 2 new tickets opened (that are marked against 4.0)
> > > > >
> > > > > 11 tickets closed (including 

Re: [DISCUSS] Client protocol changes (Was: 20200217 4.0 Status Update)

2020-02-18 Thread Jeff Jirsa
A few notes:

- Protocol changes add work to the rest of the ecosystem. Drivers have to
update, etc.
- Nobody expects protocol changes in minors, though it's less of a concern
if we don't deprecate out the older version. E.g. if 4.0 launches with
protocol v4 and protocol v5, and then 4.0.2 adds protocol v6, do we
deprecate out v4? If yes, you potentially break clients that only supported
v3 and v4 in a minor version upgrade, which is unexpected. If not, how many
protocol versions are you willing to support at any given time?
- Having protocol changes introduces risk. Paging behavior across protocol
versions is the site of a number of different bugs recently.


On Tue, Feb 18, 2020 at 1:46 PM Tolbert, Andrew  wrote:

> I don't know the technical answer, but I suspect two motivations for
> doing new protocol versions in major releases could include:
>
> * protocol changes can be tied to feature changes that typically come
> in a major release.
> * protocol changes should be as infrequent as major releases.  Each
> new protocol version is another thing in the test matrix that needs to
> be tested.
>
> That last point can make it hard to get new changes in. If something
> doesn't make the upcoming protocol version, it might be years before
> another one, but I also think it's worth it to do this infrequently as
> it makes maintaining client and server code easier if there are less
> protocol versions to worry about.
>
> On the client-side, libraries themselves should be avoiding making
> Cassandra version checks when detecting capabilities.  There are a few
> exceptions, such as system table parsing for schema & peers,
> but those aren't related to the protocol.
>
> Thanks,
> Andy
>
>
>
>
>
> On Tue, Feb 18, 2020 at 1:22 PM Nate McCall  wrote:
> >
> > [Moving to new message thread]
> >
> > Thanks for bringing this up, Jordan.
> >
> > IIRC, this was more a convention than a technical reason. Though I could
> be
> > completely misremembering this.
> >
> > -- Forwarded message -
> > From: Jordan West 
> > Date: Wed, Feb 19, 2020 at 10:13 AM
> > Subject: Re: 20200217 4.0 Status Update
> > To: 
> >
> >
> > On Mon, Feb 17, 2020 at 12:52 PM Jeff Jirsa  wrote:
> >
> > >
> > > beyond the client proto change being painful for anything other than
> major
> > > releases
> > >
> > >
> > This came up during the community meeting today and I wanted to bring a
> > question about it to the list: could someone who is *very* familiar with
> > the client proto share w/ the list why changing the proto in anything
> other
> > than a major release is so difficult? I hear this a lot and it seems to
> be
> > fact. So that all of us don't have to go read the code, a brief summary
> > would be super helpful. Or if there is a ticket that already covers this
> > even better! I'd also be curious if there have ever been any thoughts to
> > address it as it seems to be a consistent hurdle during the release cycle
> > and one that tends to further increase scope.
> >
> > Thanks,
> > Jordan
> >
> > >
> > >
> > > > On Feb 17, 2020, at 12:43 PM, Jon Meredith 
> > > wrote:
> > > >
> > > > My turn to give an update on 4.0 status. The 4.0 board created by
> Josh
> > > can
> > > > be found at
> > > >
> > > >
> > > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> > > >
> > > >
> > > > We have 94 unresolved tickets marked against the 4.0 release. [1]
> > > >
> > > >
> > > > Things seem to have settled into a phase of working to resolve
> issues,
> > > with
> > > > few new issues added.
> > > >
> > > >
> > > > 2 new tickets opened (that are marked against 4.0)
> > > >
> > > > 11 tickets closed (including one of the newly opened ones)
> > > >
> > > > 39 tickets received updates to JIRA of some kind in the last week
> > > >
> > > >
> > > > Cumulative flow over the last couple of weeks shows todo reducing and
> > > done
> > > > increasing as it should as we continue to close out work for the
> > release.
> > > >
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=reporting=cumulativeFlowDiagram=939=936=931=1505=1506=1514=1509=1512=1507=14
> > > >
> > > >
> > > > Notables
> > > >
> > > > - Python 3 support for cqlsh has been committed (thank you all who
> > > > persevered on this)
> > > >
> > > > - Some activity on Windows support - perhaps not dead yet.
> > > >
> > > > - Lots of movement on documentation
> > > >
> > > > - Lots of activity on flaky tests.
> > > >
> > > > - Oldest ticket with a patch award goes to CASSANDRA-2848
> > > >
> > > >
> > > > There are 18 tickets marked as patch available (easy access from the
> > > > Dashboard [2], apologies if they're already picked up for review)
> > > >
> > > >
> > > > CASSANDRA-15567 Allow EXTRA_CLASSPATH to work in tarball/source
> > > > installations
> > > >
> > > > CASSANDRA-15553 Preview repair should include sstables from finalized
> > > > incremental repair sessions
> > > >
> > > > CASSANDRA-15550 Fix flaky 

Re: [DISCUSS] Client protocol changes (Was: 20200217 4.0 Status Update)

2020-02-18 Thread Tolbert, Andrew
I don't know the technical answer, but I suspect two motivations for
doing new protocol versions in major releases could include:

* protocol changes can be tied to feature changes that typically come
in a major release.
* protocol changes should be as infrequent as major releases.  Each
new protocol version is another thing in the test matrix that needs to
be tested.

That last point can make it hard to get new changes in. If something
doesn't make the upcoming protocol version, it might be years before
another one, but I also think it's worth it to do this infrequently as
it makes maintaining client and server code easier if there are less
protocol versions to worry about.

On the client-side, libraries themselves should be avoiding making
Cassandra version checks when detecting capabilities.  There are a few
exceptions, such as system table parsing for schema & peers,
but those aren't related to the protocol.

Thanks,
Andy





On Tue, Feb 18, 2020 at 1:22 PM Nate McCall  wrote:
>
> [Moving to new message thread]
>
> Thanks for bringing this up, Jordan.
>
> IIRC, this was more a convention than a technical reason. Though I could be
> completely misremembering this.
>
> -- Forwarded message -
> From: Jordan West 
> Date: Wed, Feb 19, 2020 at 10:13 AM
> Subject: Re: 20200217 4.0 Status Update
> To: 
>
>
> On Mon, Feb 17, 2020 at 12:52 PM Jeff Jirsa  wrote:
>
> >
> > beyond the client proto change being painful for anything other than major
> > releases
> >
> >
> This came up during the community meeting today and I wanted to bring a
> question about it to the list: could someone who is *very* familiar with
> the client proto share w/ the list why changing the proto in anything other
> than a major release is so difficult? I hear this a lot and it seems to be
> fact. So that all of us don't have to go read the code, a brief summary
> would be super helpful. Or if there is a ticket that already covers this
> even better! I'd also be curious if there have ever been any thoughts to
> address it as it seems to be a consistent hurdle during the release cycle
> and one that tends to further increase scope.
>
> Thanks,
> Jordan
>
> >
> >
> > > On Feb 17, 2020, at 12:43 PM, Jon Meredith 
> > wrote:
> > >
> > > My turn to give an update on 4.0 status. The 4.0 board created by Josh
> > can
> > > be found at
> > >
> > >
> > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> > >
> > >
> > > We have 94 unresolved tickets marked against the 4.0 release. [1]
> > >
> > >
> > > Things seem to have settled into a phase of working to resolve issues,
> > with
> > > few new issues added.
> > >
> > >
> > > 2 new tickets opened (that are marked against 4.0)
> > >
> > > 11 tickets closed (including one of the newly opened ones)
> > >
> > > 39 tickets received updates to JIRA of some kind in the last week
> > >
> > >
> > > Cumulative flow over the last couple of weeks shows todo reducing and
> > done
> > > increasing as it should as we continue to close out work for the
> release.
> > >
> > >
> > >
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=reporting=cumulativeFlowDiagram=939=936=931=1505=1506=1514=1509=1512=1507=14
> > >
> > >
> > > Notables
> > >
> > > - Python 3 support for cqlsh has been committed (thank you all who
> > > persevered on this)
> > >
> > > - Some activity on Windows support - perhaps not dead yet.
> > >
> > > - Lots of movement on documentation
> > >
> > > - Lots of activity on flaky tests.
> > >
> > > - Oldest ticket with a patch award goes to CASSANDRA-2848
> > >
> > >
> > > There are 18 tickets marked as patch available (easy access from the
> > > Dashboard [2], apologies if they're already picked up for review)
> > >
> > >
> > > CASSANDRA-15567 Allow EXTRA_CLASSPATH to work in tarball/source
> > > installations
> > >
> > > CASSANDRA-15553 Preview repair should include sstables from finalized
> > > incremental repair sessions
> > >
> > > CASSANDRA-15550 Fix flaky test
> > > org.apache.cassandra.streaming.StreamTransferTaskTest
> > > testFailSessionDuringTransferShouldNotReleaseReferences
> > >
> > > CASSANDRA-15488/CASSANDRA-15353 Configuration file
> > >
> > > CASSANDRA-15484/CASSANDRA-15353 Read Repair
> > >
> > > CASSANDRA-15482/CASSANDRA-15353 Guarantees
> > >
> > > CASSANDRA-15481/CASSANDRA-15353 Data Modeling
> > >
> > > CASSANDRA-15393/CASSANDRA-15387 Add byte array backed cells
> > >
> > > CASSANDRA-15391/CASSANDRA-15387 Reduce heap footprint of commonly
> > allocated
> > > objects
> > >
> > > CASSANDRA-15367 Memtable memory allocations may deadlock
> > >
> > > CASSANDRA-15308 Fix flakey testAcquireReleaseOutbound -
> > > org.apache.cassandra.net.ConnectionTest
> > >
> > > CASSANDRA-1530 5Fix multi DC nodetool status output
> > >
> > > CASSANDRA-14973 Bring v5 driver out of beta, introduce v6 before 4.0
> > > release is cut
> > >
> > > CASSANDRA-14939 fix some operational holes in incremental repair
> > >
> > > 

[DISCUSS] Client protocol changes (Was: 20200217 4.0 Status Update)

2020-02-18 Thread Nate McCall
[Moving to new message thread]

Thanks for bringing this up, Jordan.

IIRC, this was more a convention than a technical reason. Though I could be
completely misremembering this.

-- Forwarded message -
From: Jordan West 
Date: Wed, Feb 19, 2020 at 10:13 AM
Subject: Re: 20200217 4.0 Status Update
To: 


On Mon, Feb 17, 2020 at 12:52 PM Jeff Jirsa  wrote:

>
> beyond the client proto change being painful for anything other than major
> releases
>
>
This came up during the community meeting today and I wanted to bring a
question about it to the list: could someone who is *very* familiar with
the client proto share w/ the list why changing the proto in anything other
than a major release is so difficult? I hear this a lot and it seems to be
fact. So that all of us don't have to go read the code, a brief summary
would be super helpful. Or if there is a ticket that already covers this
even better! I'd also be curious if there have ever been any thoughts to
address it as it seems to be a consistent hurdle during the release cycle
and one that tends to further increase scope.

Thanks,
Jordan

>
>
> > On Feb 17, 2020, at 12:43 PM, Jon Meredith 
> wrote:
> >
> > My turn to give an update on 4.0 status. The 4.0 board created by Josh
> can
> > be found at
> >
> >
> > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> >
> >
> > We have 94 unresolved tickets marked against the 4.0 release. [1]
> >
> >
> > Things seem to have settled into a phase of working to resolve issues,
> with
> > few new issues added.
> >
> >
> > 2 new tickets opened (that are marked against 4.0)
> >
> > 11 tickets closed (including one of the newly opened ones)
> >
> > 39 tickets received updates to JIRA of some kind in the last week
> >
> >
> > Cumulative flow over the last couple of weeks shows todo reducing and
> done
> > increasing as it should as we continue to close out work for the
release.
> >
> >
> >
>
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=reporting=cumulativeFlowDiagram=939=936=931=1505=1506=1514=1509=1512=1507=14
> >
> >
> > Notables
> >
> > - Python 3 support for cqlsh has been committed (thank you all who
> > persevered on this)
> >
> > - Some activity on Windows support - perhaps not dead yet.
> >
> > - Lots of movement on documentation
> >
> > - Lots of activity on flaky tests.
> >
> > - Oldest ticket with a patch award goes to CASSANDRA-2848
> >
> >
> > There are 18 tickets marked as patch available (easy access from the
> > Dashboard [2], apologies if they're already picked up for review)
> >
> >
> > CASSANDRA-15567 Allow EXTRA_CLASSPATH to work in tarball/source
> > installations
> >
> > CASSANDRA-15553 Preview repair should include sstables from finalized
> > incremental repair sessions
> >
> > CASSANDRA-15550 Fix flaky test
> > org.apache.cassandra.streaming.StreamTransferTaskTest
> > testFailSessionDuringTransferShouldNotReleaseReferences
> >
> > CASSANDRA-15488/CASSANDRA-15353 Configuration file
> >
> > CASSANDRA-15484/CASSANDRA-15353 Read Repair
> >
> > CASSANDRA-15482/CASSANDRA-15353 Guarantees
> >
> > CASSANDRA-15481/CASSANDRA-15353 Data Modeling
> >
> > CASSANDRA-15393/CASSANDRA-15387 Add byte array backed cells
> >
> > CASSANDRA-15391/CASSANDRA-15387 Reduce heap footprint of commonly
> allocated
> > objects
> >
> > CASSANDRA-15367 Memtable memory allocations may deadlock
> >
> > CASSANDRA-15308 Fix flakey testAcquireReleaseOutbound -
> > org.apache.cassandra.net.ConnectionTest
> >
> > CASSANDRA-1530 5Fix multi DC nodetool status output
> >
> > CASSANDRA-14973 Bring v5 driver out of beta, introduce v6 before 4.0
> > release is cut
> >
> > CASSANDRA-14939 fix some operational holes in incremental repair
> >
> > CASSANDRA-14904 SSTableloader doesn't understand listening for CQL
> > connections on multiple ports
> >
> > CASSANDRA-14842 SSL connection problems when upgrading to 4.0 when
> > upgrading from 3.0.x
> >
> > CASSANDRA-14761 Rename speculative_retry to match
additional_write_policy
> >
> > CASSANDRA-2848 Make the Client API support passing down timeouts
> >
> >
> > *LHF / Failing Tests*: We have 7 unassigned test failures that are all
> >
> > great candidates to pick up and get involved in:
> >
> >
>
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=1660=1661=1658
> >
> >
> > Thanks again to everybody for all the contributions. It's really good to
> > see the open issue count start dropping.
> >
> >
> > Feedback on whether this information is useful and how it can be
improved
> > is both welcome and appreciated.
> >
> >
> > Cheers, Jon
> >
> >
> > [1] Unresolved 4.0 tickets
> >
>
https://issues.apache.org/jira/browse/CASSANDRA-15567?filter=12347782=project%20%3D%20cassandra%20AND%20fixversion%20in%20(4.0%2C%204.0.0%2C%204.0-alpha%2C%204.0-beta)%20AND%20status%20!%3D%20Resolved
> >
> > [2] Patch Available
> >
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12334910
>
> 

Re: 20200217 4.0 Status Update

2020-02-18 Thread Nate McCall
Moving to a new thread.

On Wed, Feb 19, 2020 at 10:13 AM Jordan West  wrote:

> On Mon, Feb 17, 2020 at 12:52 PM Jeff Jirsa  wrote:
>
> >
> > beyond the client proto change being painful for anything other than
> major
> > releases
> >
> >
> This came up during the community meeting today and I wanted to bring a
> question about it to the list: could someone who is *very* familiar with
> the client proto share w/ the list why changing the proto in anything other
> than a major release is so difficult? I hear this a lot and it seems to be
> fact. So that all of us don't have to go read the code, a brief summary
> would be super helpful. Or if there is a ticket that already covers this
> even better! I'd also be curious if there have ever been any thoughts to
> address it as it seems to be a consistent hurdle during the release cycle
> and one that tends to further increase scope.
>
> Thanks,
> Jordan
>
> >
> >
> > > On Feb 17, 2020, at 12:43 PM, Jon Meredith 
> > wrote:
> > >
> > > My turn to give an update on 4.0 status. The 4.0 board created by Josh
> > can
> > > be found at
> > >
> > >
> > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> > >
> > >
> > > We have 94 unresolved tickets marked against the 4.0 release. [1]
> > >
> > >
> > > Things seem to have settled into a phase of working to resolve issues,
> > with
> > > few new issues added.
> > >
> > >
> > > 2 new tickets opened (that are marked against 4.0)
> > >
> > > 11 tickets closed (including one of the newly opened ones)
> > >
> > > 39 tickets received updates to JIRA of some kind in the last week
> > >
> > >
> > > Cumulative flow over the last couple of weeks shows todo reducing and
> > done
> > > increasing as it should as we continue to close out work for the
> release.
> > >
> > >
> > >
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=reporting=cumulativeFlowDiagram=939=936=931=1505=1506=1514=1509=1512=1507=14
> > >
> > >
> > > Notables
> > >
> > > - Python 3 support for cqlsh has been committed (thank you all who
> > > persevered on this)
> > >
> > > - Some activity on Windows support - perhaps not dead yet.
> > >
> > > - Lots of movement on documentation
> > >
> > > - Lots of activity on flaky tests.
> > >
> > > - Oldest ticket with a patch award goes to CASSANDRA-2848
> > >
> > >
> > > There are 18 tickets marked as patch available (easy access from the
> > > Dashboard [2], apologies if they're already picked up for review)
> > >
> > >
> > > CASSANDRA-15567 Allow EXTRA_CLASSPATH to work in tarball/source
> > > installations
> > >
> > > CASSANDRA-15553 Preview repair should include sstables from finalized
> > > incremental repair sessions
> > >
> > > CASSANDRA-15550 Fix flaky test
> > > org.apache.cassandra.streaming.StreamTransferTaskTest
> > > testFailSessionDuringTransferShouldNotReleaseReferences
> > >
> > > CASSANDRA-15488/CASSANDRA-15353 Configuration file
> > >
> > > CASSANDRA-15484/CASSANDRA-15353 Read Repair
> > >
> > > CASSANDRA-15482/CASSANDRA-15353 Guarantees
> > >
> > > CASSANDRA-15481/CASSANDRA-15353 Data Modeling
> > >
> > > CASSANDRA-15393/CASSANDRA-15387 Add byte array backed cells
> > >
> > > CASSANDRA-15391/CASSANDRA-15387 Reduce heap footprint of commonly
> > allocated
> > > objects
> > >
> > > CASSANDRA-15367 Memtable memory allocations may deadlock
> > >
> > > CASSANDRA-15308 Fix flakey testAcquireReleaseOutbound -
> > > org.apache.cassandra.net.ConnectionTest
> > >
> > > CASSANDRA-1530 5Fix multi DC nodetool status output
> > >
> > > CASSANDRA-14973 Bring v5 driver out of beta, introduce v6 before 4.0
> > > release is cut
> > >
> > > CASSANDRA-14939 fix some operational holes in incremental repair
> > >
> > > CASSANDRA-14904 SSTableloader doesn't understand listening for CQL
> > > connections on multiple ports
> > >
> > > CASSANDRA-14842 SSL connection problems when upgrading to 4.0 when
> > > upgrading from 3.0.x
> > >
> > > CASSANDRA-14761 Rename speculative_retry to match
> additional_write_policy
> > >
> > > CASSANDRA-2848 Make the Client API support passing down timeouts
> > >
> > >
> > > *LHF / Failing Tests*: We have 7 unassigned test failures that are all
> > >
> > > great candidates to pick up and get involved in:
> > >
> > >
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=1660=1661=1658
> > >
> > >
> > > Thanks again to everybody for all the contributions. It's really good
> to
> > > see the open issue count start dropping.
> > >
> > >
> > > Feedback on whether this information is useful and how it can be
> improved
> > > is both welcome and appreciated.
> > >
> > >
> > > Cheers, Jon
> > >
> > >
> > > [1] Unresolved 4.0 tickets
> > >
> >
> https://issues.apache.org/jira/browse/CASSANDRA-15567?filter=12347782=project%20%3D%20cassandra%20AND%20fixversion%20in%20(4.0%2C%204.0.0%2C%204.0-alpha%2C%204.0-beta)%20AND%20status%20!%3D%20Resolved
> > >
> > > [2] Patch Available
> > >
> >
> 

Re: 20200217 4.0 Status Update

2020-02-18 Thread Jordan West
On Mon, Feb 17, 2020 at 12:52 PM Jeff Jirsa  wrote:

>
> beyond the client proto change being painful for anything other than major
> releases
>
>
This came up during the community meeting today and I wanted to bring a
question about it to the list: could someone who is *very* familiar with
the client proto share w/ the list why changing the proto in anything other
than a major release is so difficult? I hear this a lot and it seems to be
fact. So that all of us don't have to go read the code, a brief summary
would be super helpful. Or if there is a ticket that already covers this
even better! I'd also be curious if there have ever been any thoughts to
address it as it seems to be a consistent hurdle during the release cycle
and one that tends to further increase scope.

Thanks,
Jordan

>
>
> > On Feb 17, 2020, at 12:43 PM, Jon Meredith 
> wrote:
> >
> > My turn to give an update on 4.0 status. The 4.0 board created by Josh
> can
> > be found at
> >
> >
> > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355.
> >
> >
> > We have 94 unresolved tickets marked against the 4.0 release. [1]
> >
> >
> > Things seem to have settled into a phase of working to resolve issues,
> with
> > few new issues added.
> >
> >
> > 2 new tickets opened (that are marked against 4.0)
> >
> > 11 tickets closed (including one of the newly opened ones)
> >
> > 39 tickets received updates to JIRA of some kind in the last week
> >
> >
> > Cumulative flow over the last couple of weeks shows todo reducing and
> done
> > increasing as it should as we continue to close out work for the release.
> >
> >
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=reporting=cumulativeFlowDiagram=939=936=931=1505=1506=1514=1509=1512=1507=14
> >
> >
> > Notables
> >
> > - Python 3 support for cqlsh has been committed (thank you all who
> > persevered on this)
> >
> > - Some activity on Windows support - perhaps not dead yet.
> >
> > - Lots of movement on documentation
> >
> > - Lots of activity on flaky tests.
> >
> > - Oldest ticket with a patch award goes to CASSANDRA-2848
> >
> >
> > There are 18 tickets marked as patch available (easy access from the
> > Dashboard [2], apologies if they're already picked up for review)
> >
> >
> > CASSANDRA-15567 Allow EXTRA_CLASSPATH to work in tarball/source
> > installations
> >
> > CASSANDRA-15553 Preview repair should include sstables from finalized
> > incremental repair sessions
> >
> > CASSANDRA-15550 Fix flaky test
> > org.apache.cassandra.streaming.StreamTransferTaskTest
> > testFailSessionDuringTransferShouldNotReleaseReferences
> >
> > CASSANDRA-15488/CASSANDRA-15353 Configuration file
> >
> > CASSANDRA-15484/CASSANDRA-15353 Read Repair
> >
> > CASSANDRA-15482/CASSANDRA-15353 Guarantees
> >
> > CASSANDRA-15481/CASSANDRA-15353 Data Modeling
> >
> > CASSANDRA-15393/CASSANDRA-15387 Add byte array backed cells
> >
> > CASSANDRA-15391/CASSANDRA-15387 Reduce heap footprint of commonly
> allocated
> > objects
> >
> > CASSANDRA-15367 Memtable memory allocations may deadlock
> >
> > CASSANDRA-15308 Fix flakey testAcquireReleaseOutbound -
> > org.apache.cassandra.net.ConnectionTest
> >
> > CASSANDRA-1530 5Fix multi DC nodetool status output
> >
> > CASSANDRA-14973 Bring v5 driver out of beta, introduce v6 before 4.0
> > release is cut
> >
> > CASSANDRA-14939 fix some operational holes in incremental repair
> >
> > CASSANDRA-14904 SSTableloader doesn't understand listening for CQL
> > connections on multiple ports
> >
> > CASSANDRA-14842 SSL connection problems when upgrading to 4.0 when
> > upgrading from 3.0.x
> >
> > CASSANDRA-14761 Rename speculative_retry to match additional_write_policy
> >
> > CASSANDRA-2848 Make the Client API support passing down timeouts
> >
> >
> > *LHF / Failing Tests*: We have 7 unassigned test failures that are all
> >
> > great candidates to pick up and get involved in:
> >
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=1660=1661=1658
> >
> >
> > Thanks again to everybody for all the contributions. It's really good to
> > see the open issue count start dropping.
> >
> >
> > Feedback on whether this information is useful and how it can be improved
> > is both welcome and appreciated.
> >
> >
> > Cheers, Jon
> >
> >
> > [1] Unresolved 4.0 tickets
> >
> https://issues.apache.org/jira/browse/CASSANDRA-15567?filter=12347782=project%20%3D%20cassandra%20AND%20fixversion%20in%20(4.0%2C%204.0.0%2C%204.0-alpha%2C%204.0-beta)%20AND%20status%20!%3D%20Resolved
> >
> > [2] Patch Available
> >
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12334910
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: Testing out JIRA as replacement for cwiki tracking of 4.0 quality testing

2020-02-18 Thread Joshua McKenzie
I went ahead and imported the rest of the issues from cwiki and setup
assignee = shephard, reviewers == contributors.

Epic in JIRA 

Query in JIRA of the tickets created:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20CASSANDRA%20and%20%22Epic%20Link%22%20%3D%20CASSANDRA-15536

Note: this'll bloat our #'s next week on status update, but that's probably
for the best as this was "invisible scope" of a sort.

Are there any proponents of the cwiki approach or is there any feedback /
thoughts on the JIRA approach?

Thanks.

~Josh

On Mon, Feb 3, 2020 at 3:39 PM Joshua McKenzie  wrote:

> From the people that have modified this page in the past, what are your
> thoughts? Good for me to pull the rest into JIRA and we redirect from the
> wiki?
> +joey lynch
> +scott andreas
> +sumanth pasupuleti
> +marcus eriksson
> +romain hardouin
>
>
> On Mon, Feb 3, 2020 at 8:57 AM Joshua McKenzie 
> wrote:
>
>> what we really need is
>>> some dedicated PM time going forward. Is that something you think you can
>>> help resource from your side?
>>
>> Not a ton, but I think enough yes.
>>
>> (Also, thanks for all the efforts exploring this either way!!)
>>
>> Happy to help.
>>
>> On Sun, Feb 2, 2020 at 2:46 PM Nate McCall  wrote:
>>
>>> > 
>>> > My .02: I think it'd improve our ability to collaborate and lower
>>> friction
>>> > to testing if we could do so on JIRA instead of the cwiki. *I suspect
>>> *the
>>> > edit access restrictions there plus general UX friction (difficult to
>>> have
>>> > collab discussion, comment chains, links to things, etc) make the
>>> confluent
>>> > wiki a worse tool for this job than JIRA. Plus if we do it in JIRA we
>>> can
>>> > track the outstanding scope in the single board and it's far easier to
>>> > visualize everything in one place so we can all know where attention
>>> and
>>> > resources need to be directed to best move the needle on things.
>>> >
>>> > But that's just my opinion. What does everyone else think? Like the
>>> JIRA
>>> > route? Hate it? No opinion?
>>> >
>>> > If we do decide we want to go the epic / JIRA route, I'd be happy to
>>> > migrate the rest of the information in there for things that haven't
>>> been
>>> > completed yet on the wiki (ticket creation, assignee/reviewer chains,
>>> links
>>> > to epic).
>>> >
>>> > So what does everyone think?
>>> >
>>>
>>> I think this is a good idea. Having the resources available to keep the
>>> various bits twiddled correctly on existing and new issues has always
>>> been
>>> the hard part for us. So regardless of the path, what we really need is
>>> some dedicated PM time going forward. Is that something you think you can
>>> help resource from your side?
>>>
>>> (Also, thanks for all the efforts exploring this either way!!)
>>>
>>


Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-18 Thread Jeremiah D Jordan
+1 for 8 + algorithm assignment being the default.

Why do we have to assume random assignment?  If someone turns off algorithm 
assignment they are changing away from defaults, so they should also adjust the 
num tokens.

-Jeremiah

> On Feb 18, 2020, at 1:44 AM, Mick Semb Wever  wrote:
> 
> -1
> 
> Discussions here and on slack have brought up a number of important
> concerns. I think those concerns need to be summarised here before any
> informal vote.
> 
> It was my understanding that some of those concerns may even be blockers to
> a move to 16. That is we have to presume the worse case scenario where all
> tokens get randomly generated.
> 
> Can we ask for some analysis and data against the risks different
> num_tokens choices present. We shouldn't rush into a new default, and such
> background information and data is operator value added. Maybe I missed any
> info/experiments that have happened?
> 
> 
> 
> On Mon., 17 Feb. 2020, 11:14 pm Jeremy Hanna, 
> wrote:
> 
>> I just wanted to close the loop on this if possible.  After some discussion
>> in slack about various topics, I would like to see if people are okay with
>> num_tokens=8 by default (as it's not much different operationally than
>> 16).  Joey brought up a few small changes that I can put on the ticket.  It
>> also requires some documentation for things like decommission order and
>> skew.
>> 
>> Are people okay with this change moving forward like this?  If so, I'll
>> comment on the ticket and we can move forward.
>> 
>> Thanks,
>> 
>> Jeremy
>> 
>> On Tue, Feb 4, 2020 at 8:45 AM Jon Haddad  wrote:
>> 
>>> I think it's a good idea to take a step back and get a high level view of
>>> the problem we're trying to solve.
>>> 
>>> First, high token counts result in decreased availability as each node
>> has
>>> data overlap with with more nodes in the cluster.  Specifically, a node
>> can
>>> share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at RF=3 is
>>> going to almost always share data with every other node in the cluster
>> that
>>> isn't in the same rack, unless you're doing something wild like using
>> more
>>> than a thousand nodes in a cluster.  We advertise
>>> 
>>> With 16 tokens, that is vastly improved, but you still have up to 64
>> nodes
>>> each node needs to query against, so you're again, hitting every node
>>> unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
>>> wouldn't use 16 here, and I doubt any of you would either.  I've
>> advocated
>>> for 4 tokens because you'd have overlap with only 16 nodes, which works
>>> well for small clusters as well as large.  Assuming I was creating a new
>>> cluster for myself (in a hypothetical brand new application I'm
>> building) I
>>> would put this in production.  I have worked with several teams where I
>>> helped them put 4 token clusters in prod and it has worked very well.  We
>>> didn't see any wild imbalance issues.
>>> 
>>> As Mick's pointed out, our current method of using random token
>> assignment
>>> for the default number of problematic for 4 tokens.  I fully agree with
>>> this, and I think if we were to try to use 4 tokens, we'd want to address
>>> this in tandem.  We can discuss how to better allocate tokens by default
>>> (something more predictable than random), but I'd like to avoid the
>>> specifics of that for the sake of this email.
>>> 
>>> To Alex's point, repairs are problematic with lower token counts due to
>>> over streaming.  I think this is a pretty serious issue and I we'd have
>> to
>>> address it before going all the way down to 4.  This, in my opinion, is a
>>> more complex problem to solve and I think trying to fix it here could
>> make
>>> shipping 4.0 take even longer, something none of us want.
>>> 
>>> For the sake of shipping 4.0 without adding extra overhead and time, I'm
>> ok
>>> with moving to 16 tokens, and in the process adding extensive
>> documentation
>>> outlining what we recommend for production use.  I think we should also
>> try
>>> to figure out something better than random as the default to fix the data
>>> imbalance issues.  I've got a few ideas here I've been noodling on.
>>> 
>>> As long as folks are fine with potentially changing the default again in
>> C*
>>> 5.0 (after another discussion / debate), 16 is enough of an improvement
>>> that I'm OK with the change, and willing to author the docs to help
>> people
>>> set up their first cluster.  For folks that go into production with the
>>> defaults, we're at least not setting them up for total failure once their
>>> clusters get large like we are now.
>>> 
>>> In future versions, we'll probably want to address the issue of data
>>> imbalance by building something in that shifts individual tokens
>> around.  I
>>> don't think we should try to do this in 4.0 either.
>>> 
>>> Jon
>>> 
>>> 
>>> 
>>> On Fri, Jan 31, 2020 at 2:04 PM Jeremy Hanna >> 
>>> wrote:
>>> 
 I think Mick and Anthony make some valid operational and skew points
>> 

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-18 Thread Joshua McKenzie
>
> Discussions here and on slack have brought up a number of important
> concerns.

Sounds like we're letting the perfect be the enemy of the good. Is anyone
arguing that 256 is a better default than 16? Or is the fear that going to
16 now would make a default change in, say, 5.0 more painful?


On Tue, Feb 18, 2020 at 3:12 AM Ben Slater 
wrote:

> In case it helps move the decision along, we moved to 16 vnodes as default
> in Nov 2018 and haven't looked back (many clusters from 3-100s of nodes
> later). The testing we did in making that decision is summarised here:
> https://www.instaclustr.com/cassandra-vnodes-how-many-should-i-use/
>
>  >Cheers
> Ben
>
> ---
>
>
> *Ben Slater**Chief Product Officer*
>
> 
>
>    
> 
>
> Read our latest technical blog posts here
> .
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
> On Tue, 18 Feb 2020 at 18:44, Mick Semb Wever 
> wrote:
>
> > -1
> >
> > Discussions here and on slack have brought up a number of important
> > concerns. I think those concerns need to be summarised here before any
> > informal vote.
> >
> > It was my understanding that some of those concerns may even be blockers
> to
> > a move to 16. That is we have to presume the worse case scenario where
> all
> > tokens get randomly generated.
> >
> > Can we ask for some analysis and data against the risks different
> > num_tokens choices present. We shouldn't rush into a new default, and
> such
> > background information and data is operator value added. Maybe I missed
> any
> > info/experiments that have happened?
> >
> >
> >
> > On Mon., 17 Feb. 2020, 11:14 pm Jeremy Hanna, <
> jeremy.hanna1...@gmail.com>
> > wrote:
> >
> > > I just wanted to close the loop on this if possible.  After some
> > discussion
> > > in slack about various topics, I would like to see if people are okay
> > with
> > > num_tokens=8 by default (as it's not much different operationally than
> > > 16).  Joey brought up a few small changes that I can put on the ticket.
> > It
> > > also requires some documentation for things like decommission order and
> > > skew.
> > >
> > > Are people okay with this change moving forward like this?  If so, I'll
> > > comment on the ticket and we can move forward.
> > >
> > > Thanks,
> > >
> > > Jeremy
> > >
> > > On Tue, Feb 4, 2020 at 8:45 AM Jon Haddad  wrote:
> > >
> > > > I think it's a good idea to take a step back and get a high level
> view
> > of
> > > > the problem we're trying to solve.
> > > >
> > > > First, high token counts result in decreased availability as each
> node
> > > has
> > > > data overlap with with more nodes in the cluster.  Specifically, a
> node
> > > can
> > > > share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at
> RF=3
> > is
> > > > going to almost always share data with every other node in the
> cluster
> > > that
> > > > isn't in the same rack, unless you're doing something wild like using
> > > more
> > > > than a thousand nodes in a cluster.  We advertise
> > > >
> > > > With 16 tokens, that is vastly improved, but you still have up to 64
> > > nodes
> > > > each node needs to query against, so you're again, hitting every node
> > > > unless you go above ~96 nodes in the cluster (assuming 3 racks /
> > AZs).  I
> > > > wouldn't use 16 here, and I doubt any of you would either.  I've
> > > advocated
> > > > for 4 tokens because you'd have overlap with only 16 nodes, which
> works
> > > > well for small clusters as well as large.  Assuming I was creating a
> > new
> > > > cluster for myself (in a hypothetical brand new application I'm
> > > building) I
> > > > would put this in production.  I have worked with several teams
> where I
> > > > helped them put 4 token clusters in prod and it has worked very well.
> > We
> > > > didn't see any wild imbalance issues.
> > > >
> > > > As Mick's pointed out, our current method of using random token
> > > assignment
> > > > for the default number of problematic for 4 tokens.  I fully agree
> with
> > > > this, and I think if we were to try to use 4 tokens, we'd want to
> > address
> > > > this in tandem.  We can discuss how to better allocate tokens by
> > default
> > > > (something more predictable than random), but I'd like to avoid the
> > > > specifics of that for the sake of this email.
> > > >
> > > > To Alex's point, repairs are problematic with lower token counts due
> to
> > > > over 

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-18 Thread Ben Slater
In case it helps move the decision along, we moved to 16 vnodes as default
in Nov 2018 and haven't looked back (many clusters from 3-100s of nodes
later). The testing we did in making that decision is summarised here:
https://www.instaclustr.com/cassandra-vnodes-how-many-should-i-use/

Cheers
Ben

---


*Ben Slater**Chief Product Officer*



   


Read our latest technical blog posts here
.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


On Tue, 18 Feb 2020 at 18:44, Mick Semb Wever 
wrote:

> -1
>
> Discussions here and on slack have brought up a number of important
> concerns. I think those concerns need to be summarised here before any
> informal vote.
>
> It was my understanding that some of those concerns may even be blockers to
> a move to 16. That is we have to presume the worse case scenario where all
> tokens get randomly generated.
>
> Can we ask for some analysis and data against the risks different
> num_tokens choices present. We shouldn't rush into a new default, and such
> background information and data is operator value added. Maybe I missed any
> info/experiments that have happened?
>
>
>
> On Mon., 17 Feb. 2020, 11:14 pm Jeremy Hanna, 
> wrote:
>
> > I just wanted to close the loop on this if possible.  After some
> discussion
> > in slack about various topics, I would like to see if people are okay
> with
> > num_tokens=8 by default (as it's not much different operationally than
> > 16).  Joey brought up a few small changes that I can put on the ticket.
> It
> > also requires some documentation for things like decommission order and
> > skew.
> >
> > Are people okay with this change moving forward like this?  If so, I'll
> > comment on the ticket and we can move forward.
> >
> > Thanks,
> >
> > Jeremy
> >
> > On Tue, Feb 4, 2020 at 8:45 AM Jon Haddad  wrote:
> >
> > > I think it's a good idea to take a step back and get a high level view
> of
> > > the problem we're trying to solve.
> > >
> > > First, high token counts result in decreased availability as each node
> > has
> > > data overlap with with more nodes in the cluster.  Specifically, a node
> > can
> > > share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at RF=3
> is
> > > going to almost always share data with every other node in the cluster
> > that
> > > isn't in the same rack, unless you're doing something wild like using
> > more
> > > than a thousand nodes in a cluster.  We advertise
> > >
> > > With 16 tokens, that is vastly improved, but you still have up to 64
> > nodes
> > > each node needs to query against, so you're again, hitting every node
> > > unless you go above ~96 nodes in the cluster (assuming 3 racks /
> AZs).  I
> > > wouldn't use 16 here, and I doubt any of you would either.  I've
> > advocated
> > > for 4 tokens because you'd have overlap with only 16 nodes, which works
> > > well for small clusters as well as large.  Assuming I was creating a
> new
> > > cluster for myself (in a hypothetical brand new application I'm
> > building) I
> > > would put this in production.  I have worked with several teams where I
> > > helped them put 4 token clusters in prod and it has worked very well.
> We
> > > didn't see any wild imbalance issues.
> > >
> > > As Mick's pointed out, our current method of using random token
> > assignment
> > > for the default number of problematic for 4 tokens.  I fully agree with
> > > this, and I think if we were to try to use 4 tokens, we'd want to
> address
> > > this in tandem.  We can discuss how to better allocate tokens by
> default
> > > (something more predictable than random), but I'd like to avoid the
> > > specifics of that for the sake of this email.
> > >
> > > To Alex's point, repairs are problematic with lower token counts due to
> > > over streaming.  I think this is a pretty serious issue and I we'd have
> > to
> > > address it before going all the way down to 4.  This, in my opinion,
> is a
> > > more complex problem to solve and I think trying to fix it here could
> > make
> > > shipping 4.0 take even longer, something none of us want.
> > >
> > > For the sake of shipping 4.0 without adding extra overhead and time,
> I'm
> > ok
> > > with moving to 16 tokens, and in the process adding extensive
> > documentation
> > > outlining what we recommend for production use.  I think we should also
> > try
> > > to figure out something better than random as the