Re: Measuring Release Quality

2018-09-22 Thread kurt greaves
Yep agreed with that. Count me in.

On Sun., 23 Sep. 2018, 00:33 Benedict Elliott Smith, 
wrote:

> Thanks Kurt.  I think the goal would be to get JIRA into a state where it
> can hold all the information we want, and for it to be easy to get all the
> information correct when filing.
>
> My feeling is that it would be easiest to do this with a small group, so
> we can make rapid progress on an initial proposal, then bring that to the
> community for final tweaking / approval (or, perhaps, rejection - but I
> hope it won’t be a contentious topic).  I don’t think it should be a huge
> job to come up with a proposal - though we might need to organise a
> community effort to clean up the JIRA history!
>
> It would be great if we could get a few more volunteers from other
> companies/backgrounds to participate.
>
>
> > On 22 Sep 2018, at 11:54, kurt greaves  wrote:
> >
> > I'm interested. Better defining the components and labels we use in our
> > docs would be a good start and LHF. I'd prefer if we kept all the
> > information within JIRA through the use of fields/labels though, and
> > generated reports off those tags. Keeping all the information in one
> place
> > is much better in my experience. Not applicable for CI obviously, but
> > ideally we can generate testing reports directly from the testing
> systems.
> >
> > I don't see this as a huge amount of work so I think the overall risk is
> > pretty small, especially considering it can easily be done in a way that
> > doesn't affect anyone until we get consensus on methodology.
> >
> >
> >
> > On Sat, 22 Sep 2018 at 03:44, Scott Andreas 
> wrote:
> >
> >> Josh, thanks for reading and sharing feedback. Agreed with starting
> simple
> >> and measuring inputs that are high-signal; that’s a good place to begin.
> >>
> >> To the challenge of building consensus, point taken + agreed. Perhaps
> the
> >> distinction is between producing something that’s “useful” vs. something
> >> that’s “authoritative” for decisionmaking purposes. My motivation is to
> >> work toward something “useful” (as measured by the value contributors
> >> find). I’d be happy to start putting some of these together as part of
> an
> >> experiment – and agreed on evaluating “value relative to cost” after we
> see
> >> how things play out.
> >>
> >> To Benedict’s point on JIRA, agreed that plotting a value from messy
> input
> >> wouldn’t produce useful output. Some questions a small working group
> might
> >> take on toward better categorization might look like:
> >>
> >> –––
> >> – Revisiting the list of components: e.g., “Core” captures a lot right
> now.
> >> – Revisiting which fields should be required when filing a ticket – and
> if
> >> there are any that should be removed from the form.
> >> – Reviewing active labels: understanding what people have been trying to
> >> capture, and how they could be organized + documented better.
> >> – Documenting “priority”: (e.g., a common standard we can point to, even
> >> if we’re pretty good now).
> >> – Considering adding a "severity” field to capture the distinction
> between
> >> priority and severity.
> >> –––
> >>
> >> If there’s appetite for spending a little time on this, I’d put effort
> >> toward it if others are interested; is anyone?
> >>
> >> Otherwise, I’m equally fine with an experiment to measure basics via the
> >> current structure as Josh mentioned, too.
> >>
> >> – Scott
> >>
> >>
> >> On September 20, 2018 at 8:22:55 AM, Benedict Elliott Smith (
> >> bened...@apache.org) wrote:
> >>
> >> I think it would be great to start getting some high quality info out of
> >> JIRA, but I think we need to clean up and standardise how we use it to
> >> facilitate this.
> >>
> >> Take the Component field as an example. This is the current list of
> >> options:
> >>
> >> 4.0
> >> Auth
> >> Build
> >> Compaction
> >> Configuration
> >> Core
> >> CQL
> >> Distributed Metadata
> >> Documentation and Website
> >> Hints
> >> Libraries
> >> Lifecycle
> >> Local Write-Read Paths
> >> Materialized Views
> >> Metrics
> >> Observability
> >> Packaging
> >> Repair
> >> SASI
> >> Secondary Indexes
> >> Streaming and Messaging
> >> Stress
> >> Testing
> >> Tools
> >>
> >> In some cases there's duplication (Metrics + Observability, Coordination
> >> (=“Storage Proxy, Hints, Batchlog, Counters…") + Hints, Local Write-Read
> >> Paths + Core)
> >> In others, there’s a lack of granularity (Streaming + Messaging, Core,
> >> Coordination, Distributed Metadata)
> >> In others, there’s a lack of clarity (Core, Lifecycle, Coordination)
> >> Others are probably missing entirely (Transient Replication, …?)
> >>
> >> Labels are also used fairly haphazardly, and there’s no clear definition
> >> of “priority”
> >>
> >> Perhaps we should form a working group to suggest a methodology for
> >> filling out JIRA, standardise the necessary components, labels etc, and
> put
> >> together a wiki page with step-by-step instructions on how to do 

Re: [DISCUSS] changing default token behavior for 4.0

2018-09-22 Thread kurt greaves
Only that it makes it easier to spin up a cluster.

I'm for removing it entirely as well, however I think we should keep it
around at least until the next major just as a safety precaution until the
algorithm is properly battle tested.

This is not a strongly held opinion though, I'm just foreseeing the "new
defaults don't work for my edge case" problem.

On Sun., 23 Sep. 2018, 04:12 Jonathan Haddad,  wrote:

> Is there a use case for random allocation? How does it help with testing? I
> can’t see a reason to keep it around.
>
> On Sat, Sep 22, 2018 at 3:06 AM kurt greaves  wrote:
>
> > +1. I've been making a case for this for some time now, and was actually
> a
> > focus of my talk last week. I'd be very happy to get this into 4.0.
> >
> > We've tested various num_tokens with the algorithm on various sized
> > clusters and we've found that typically 16 works best. With lower numbers
> > we found that balance is good initially but as a cluster gets larger you
> > have some problems. E.g We saw that on a 60 node cluster with 8 tokens
> per
> > node we were seeing a difference of 22% in token ownership, but on a <=12
> > node cluster a difference of only 12%. 16 tokens on the other hand wasn't
> > perfect but generally gave a better balance regardless of cluster size at
> > least up to 100 nodes. TBH we should probably do some proper testing and
> > record all the results for this before we pick a default (I'm happy to do
> > this - think we can use the original testing script for this).
> >
> > But anyway, I'd say Jon is on the right track. Personally how I'd like to
> > see it is that we:
> >
> >1. Change allocate_tokens_for_keyspace to allocate_tokens_for_rf in
> the
> >same way that DSE does it. Allowing a user to specify a RF to allocate
> >from, and allowing multiple DC's.
> >2. Add a new boolean property random_token_allocation, defaults to
> > false.
> >3. Make allocate_tokens_for_rf default to *unset**.
> >4. Make allocate_tokens_for_rf *required*** if num_tokens > 1 and
> >random_token_allocation != true.
> >5. Default num_tokens to 16 (or whatever we find appropriate)
> >
> > * I think setting a default is asking for trouble. When people are going
> to
> > add new DC's/nodes we don't want to risk them adding a node with the
> wrong
> > RF. I think it's safe to say that a user should have to think about this
> > before they spin up their cluster.
> > ** Following above, it should be required to be set so that we don't have
> > people accidentally using random allocation. I think we should really be
> > aiming to get rid of random allocation completely, but provide a new
> > property to enable it for backwards compatibility (also for testing).
> >
> > It's worth noting that a smaller number of tokens *theoretically*
> decreases
> > the time for replacement/rebuild, so if we're considering QUORUM
> > availability with vnodes there's an argument against having a very low
> > num_tokens. I think it's better to utilise NTS and racks to reduce the
> > chance of a QUORUM outage over banking on having a lower number of
> tokens,
> > as with just a low number of tokens unless you go all the way to 1 you
> are
> > just relying on luck that 2 nodes don't overlap. Guess what I'm saying is
> > that I think we should be choosing a num_tokens that gives the best
> > distribution for most cluster sizes rather than choosing one that
> > "decreases" the probability of an outage.
> >
> > Also I think we should continue using CASSANDRA-13701 to track this. TBH
> I
> > think in general we should be a bit better at searching for and using
> > existing tickets...
> >
> > On Sat, 22 Sep 2018 at 18:13, Stefan Podkowinski 
> wrote:
> >
> > > There already have been some discussions on this here:
> > > https://issues.apache.org/jira/browse/CASSANDRA-13701
> > >
> > > The mentioned blocker there on the token allocation shouldn't exist
> > > anymore. Although it would be good to get more feedback on it, in case
> > > we want to enable it by default, along with new defaults for number of
> > > tokens.
> > >
> > >
> > > On 22.09.18 06:30, Dinesh Joshi wrote:
> > > > Jon, thanks for starting this thread!
> > > >
> > > > I have created CASSANDRA-14784 to track this.
> > > >
> > > > Dinesh
> > > >
> > > >> On Sep 21, 2018, at 9:18 PM, Sankalp Kohli 
> > > wrote:
> > > >>
> > > >> Putting it on JIRA is to make sure someone is assigned to it and it
> is
> > > tracked. Changes should be discussed over ML like you are saying.
> > > >>
> > > >> On Sep 21, 2018, at 21:02, Jonathan Haddad 
> wrote:
> > > >>
> > >  We should create a JIRA to find what other defaults we need
> revisit.
> > > >>> Changing a default is a pretty big deal, I think we should discuss
> > any
> > > >>> changes to defaults here on the ML before moving it into JIRA.
> It's
> > > nice
> > > >>> to get a bit more discussion around the change than what happens in
> > > JIRA.
> > > >>>
> > > >>> We (TLP) did some testing on 4 tokens 

Re: [DISCUSS] changing default token behavior for 4.0

2018-09-22 Thread Jonathan Haddad
Is there a use case for random allocation? How does it help with testing? I
can’t see a reason to keep it around.

On Sat, Sep 22, 2018 at 3:06 AM kurt greaves  wrote:

> +1. I've been making a case for this for some time now, and was actually a
> focus of my talk last week. I'd be very happy to get this into 4.0.
>
> We've tested various num_tokens with the algorithm on various sized
> clusters and we've found that typically 16 works best. With lower numbers
> we found that balance is good initially but as a cluster gets larger you
> have some problems. E.g We saw that on a 60 node cluster with 8 tokens per
> node we were seeing a difference of 22% in token ownership, but on a <=12
> node cluster a difference of only 12%. 16 tokens on the other hand wasn't
> perfect but generally gave a better balance regardless of cluster size at
> least up to 100 nodes. TBH we should probably do some proper testing and
> record all the results for this before we pick a default (I'm happy to do
> this - think we can use the original testing script for this).
>
> But anyway, I'd say Jon is on the right track. Personally how I'd like to
> see it is that we:
>
>1. Change allocate_tokens_for_keyspace to allocate_tokens_for_rf in the
>same way that DSE does it. Allowing a user to specify a RF to allocate
>from, and allowing multiple DC's.
>2. Add a new boolean property random_token_allocation, defaults to
> false.
>3. Make allocate_tokens_for_rf default to *unset**.
>4. Make allocate_tokens_for_rf *required*** if num_tokens > 1 and
>random_token_allocation != true.
>5. Default num_tokens to 16 (or whatever we find appropriate)
>
> * I think setting a default is asking for trouble. When people are going to
> add new DC's/nodes we don't want to risk them adding a node with the wrong
> RF. I think it's safe to say that a user should have to think about this
> before they spin up their cluster.
> ** Following above, it should be required to be set so that we don't have
> people accidentally using random allocation. I think we should really be
> aiming to get rid of random allocation completely, but provide a new
> property to enable it for backwards compatibility (also for testing).
>
> It's worth noting that a smaller number of tokens *theoretically* decreases
> the time for replacement/rebuild, so if we're considering QUORUM
> availability with vnodes there's an argument against having a very low
> num_tokens. I think it's better to utilise NTS and racks to reduce the
> chance of a QUORUM outage over banking on having a lower number of tokens,
> as with just a low number of tokens unless you go all the way to 1 you are
> just relying on luck that 2 nodes don't overlap. Guess what I'm saying is
> that I think we should be choosing a num_tokens that gives the best
> distribution for most cluster sizes rather than choosing one that
> "decreases" the probability of an outage.
>
> Also I think we should continue using CASSANDRA-13701 to track this. TBH I
> think in general we should be a bit better at searching for and using
> existing tickets...
>
> On Sat, 22 Sep 2018 at 18:13, Stefan Podkowinski  wrote:
>
> > There already have been some discussions on this here:
> > https://issues.apache.org/jira/browse/CASSANDRA-13701
> >
> > The mentioned blocker there on the token allocation shouldn't exist
> > anymore. Although it would be good to get more feedback on it, in case
> > we want to enable it by default, along with new defaults for number of
> > tokens.
> >
> >
> > On 22.09.18 06:30, Dinesh Joshi wrote:
> > > Jon, thanks for starting this thread!
> > >
> > > I have created CASSANDRA-14784 to track this.
> > >
> > > Dinesh
> > >
> > >> On Sep 21, 2018, at 9:18 PM, Sankalp Kohli 
> > wrote:
> > >>
> > >> Putting it on JIRA is to make sure someone is assigned to it and it is
> > tracked. Changes should be discussed over ML like you are saying.
> > >>
> > >> On Sep 21, 2018, at 21:02, Jonathan Haddad  wrote:
> > >>
> >  We should create a JIRA to find what other defaults we need revisit.
> > >>> Changing a default is a pretty big deal, I think we should discuss
> any
> > >>> changes to defaults here on the ML before moving it into JIRA.  It's
> > nice
> > >>> to get a bit more discussion around the change than what happens in
> > JIRA.
> > >>>
> > >>> We (TLP) did some testing on 4 tokens and found it to work
> surprisingly
> > >>> well.   It wasn't particularly formal, but we verified the load stays
> > >>> pretty even with only 4 tokens as we added nodes to the cluster.
> > Higher
> > >>> token count hurts availability by increasing the number of nodes any
> > given
> > >>> node is a neighbor with, meaning any 2 nodes that fail have an
> > increased
> > >>> chance of downtime when using QUORUM.  In addition, with the recent
> > >>> streaming optimization it seems the token counts will give a greater
> > chance
> > >>> of a node streaming entire sstables (with LCS), meaning we'll do a
> > better

Re: Proposing an Apache Cassandra Management process

2018-09-22 Thread Sankalp Kohli
This is not part of core database and a separate repo and so my impression is 
that this can continue to make progress. Also we can always make progress and 
not merge it till freeze is lifted. 

Open to ideas/suggestions if someone thinks otherwise. 

> On Sep 22, 2018, at 03:13, kurt greaves  wrote:
> 
> Is this something we're moving ahead with despite the feature freeze?
> 
> On Sat, 22 Sep 2018 at 08:32, dinesh.jo...@yahoo.com.INVALID
>  wrote:
> 
>> I have created a sub-task - CASSANDRA-14783. Could we get some feedback
>> before we begin implementing anything?
>> 
>> Dinesh
>> 
>>On Thursday, September 20, 2018, 11:22:33 PM PDT, Dinesh Joshi <
>> dinesh.jo...@yahoo.com.INVALID> wrote:
>> 
>> I have updated the doc with a short paragraph providing the
>> clarification. Sankalp's suggestion is already part of the doc. If there
>> aren't further objections could we move this discussion over to the jira
>> (CASSANDRA-14395)?
>> 
>> Dinesh
>> 
>>> On Sep 18, 2018, at 10:31 AM, sankalp kohli 
>> wrote:
>>> 
>>> How about we start with a few basic features in side car. How about
>> starting with this
>>> 1. Bulk nodetool commands: User can curl any sidecar and be able to run
>> a nodetool command in bulk across the cluster.
>>> 
>> :/bulk/nodetool/tablestats?arg0=keyspace_name.table_name=> required>
>>> 
>>> And later
>>> 2: Health checks.
>>> 
>>> On Thu, Sep 13, 2018 at 11:34 AM dinesh.jo...@yahoo.com.INVALID <
>> dinesh.jo...@yahoo.com.invalid> wrote:
>>> I will update the document to add that point. The document did not mean
>> to serve as a design or architectural document but rather something that
>> would spark a discussion on the idea.
>>> Dinesh
>>> 
>>>   On Thursday, September 13, 2018, 10:59:34 AM PDT, Jonathan Haddad <
>> j...@jonhaddad.com > wrote:
>>> 
>>> Most of the discussion and work was done off the mailing list - there's
>> a
>>> big risk involved when folks disappear for months at a time and resurface
>>> with big pile of code plus an agenda that you failed to loop everyone in
>>> on. In addition, by your own words the design document didn't accurately
>>> describe what was being built.  I don't write this to try to argue about
>>> it, I just want to put some perspective for those of us that weren't part
>>> of this discussion on a weekly basis over the last several months.  Going
>>> forward let's keep things on the ML so we can avoid confusion and
>>> frustration for all parties.
>>> 
>>> With that said - I think Blake made a really good point here and it's
>>> helped me understand the scope of what's being built better.  Looking at
>> it
>>> from a different perspective it doesn't seem like there's as much overlap
>>> as I had initially thought.  There's the machinery that runs certain
>> tasks
>>> (what Joey has been working on) and the user facing side of exposing that
>>> information in management tool.
>>> 
>>> I do appreciate (and like) the idea of not trying to boil the ocean, and
>>> working on things incrementally.  Putting a thin layer on top of
>> Cassandra
>>> that can perform cluster wide tasks does give us an opportunity to move
>> in
>>> the direction of a general purpose user-facing admin tool without
>>> committing to trying to write the full stack all at once (or even make
>>> decisions on it now).  We do need a sensible way of doing rolling
>> restarts
>>> / scrubs / scheduling and Reaper wasn't built for that, and even though
>> we
>>> can add it I'm not sure if it's the best mechanism for the long term.
>>> 
>>> So if your goal is to add maturity to the project by making cluster wide
>>> tasks easier by providing a framework to build on top of, I'm in favor of
>>> that and I don't see it as antithetical to what I had in mind with
>> Reaper.
>>> Rather, the two are more complementary than I had originally realized.
>>> 
>>> Jon
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 13, 2018 at 10:39 AM dinesh.jo...@yahoo.com.INVALID
>>> mailto:dinesh.jo...@yahoo.com>.invalid> wrote:
>>> 
 I have a few clarifications -
 The scope of the management process is not to simply run repair
 scheduling. Repair scheduling is one of the many features we could
 implement or adopt from existing sources. So could we please split the
 Management Process discussion and the repair scheduling?
 After re-reading the management process proposal, I see we missed to
 communicate a basic idea in the document. We wanted to take a pluggable
 approach to various activities that the management process could
>> perform.
 This could accommodate different implementations of common activities
>> such
 as repair. The management process would provide the basic framework
>> and it
 would have default implementations for some of the basic activities.
>> This
 would allow for speedier iteration cycles and keep things extensible.
 Turning to some questions that Jon and others have raised, when I +1,
>> my
 intention is to fully 

Re: Measuring Release Quality

2018-09-22 Thread Benedict Elliott Smith
Thanks Kurt.  I think the goal would be to get JIRA into a state where it can 
hold all the information we want, and for it to be easy to get all the 
information correct when filing.

My feeling is that it would be easiest to do this with a small group, so we can 
make rapid progress on an initial proposal, then bring that to the community 
for final tweaking / approval (or, perhaps, rejection - but I hope it won’t be 
a contentious topic).  I don’t think it should be a huge job to come up with a 
proposal - though we might need to organise a community effort to clean up the 
JIRA history!

It would be great if we could get a few more volunteers from other 
companies/backgrounds to participate.


> On 22 Sep 2018, at 11:54, kurt greaves  wrote:
> 
> I'm interested. Better defining the components and labels we use in our
> docs would be a good start and LHF. I'd prefer if we kept all the
> information within JIRA through the use of fields/labels though, and
> generated reports off those tags. Keeping all the information in one place
> is much better in my experience. Not applicable for CI obviously, but
> ideally we can generate testing reports directly from the testing systems.
> 
> I don't see this as a huge amount of work so I think the overall risk is
> pretty small, especially considering it can easily be done in a way that
> doesn't affect anyone until we get consensus on methodology.
> 
> 
> 
> On Sat, 22 Sep 2018 at 03:44, Scott Andreas  wrote:
> 
>> Josh, thanks for reading and sharing feedback. Agreed with starting simple
>> and measuring inputs that are high-signal; that’s a good place to begin.
>> 
>> To the challenge of building consensus, point taken + agreed. Perhaps the
>> distinction is between producing something that’s “useful” vs. something
>> that’s “authoritative” for decisionmaking purposes. My motivation is to
>> work toward something “useful” (as measured by the value contributors
>> find). I’d be happy to start putting some of these together as part of an
>> experiment – and agreed on evaluating “value relative to cost” after we see
>> how things play out.
>> 
>> To Benedict’s point on JIRA, agreed that plotting a value from messy input
>> wouldn’t produce useful output. Some questions a small working group might
>> take on toward better categorization might look like:
>> 
>> –––
>> – Revisiting the list of components: e.g., “Core” captures a lot right now.
>> – Revisiting which fields should be required when filing a ticket – and if
>> there are any that should be removed from the form.
>> – Reviewing active labels: understanding what people have been trying to
>> capture, and how they could be organized + documented better.
>> – Documenting “priority”: (e.g., a common standard we can point to, even
>> if we’re pretty good now).
>> – Considering adding a "severity” field to capture the distinction between
>> priority and severity.
>> –––
>> 
>> If there’s appetite for spending a little time on this, I’d put effort
>> toward it if others are interested; is anyone?
>> 
>> Otherwise, I’m equally fine with an experiment to measure basics via the
>> current structure as Josh mentioned, too.
>> 
>> – Scott
>> 
>> 
>> On September 20, 2018 at 8:22:55 AM, Benedict Elliott Smith (
>> bened...@apache.org) wrote:
>> 
>> I think it would be great to start getting some high quality info out of
>> JIRA, but I think we need to clean up and standardise how we use it to
>> facilitate this.
>> 
>> Take the Component field as an example. This is the current list of
>> options:
>> 
>> 4.0
>> Auth
>> Build
>> Compaction
>> Configuration
>> Core
>> CQL
>> Distributed Metadata
>> Documentation and Website
>> Hints
>> Libraries
>> Lifecycle
>> Local Write-Read Paths
>> Materialized Views
>> Metrics
>> Observability
>> Packaging
>> Repair
>> SASI
>> Secondary Indexes
>> Streaming and Messaging
>> Stress
>> Testing
>> Tools
>> 
>> In some cases there's duplication (Metrics + Observability, Coordination
>> (=“Storage Proxy, Hints, Batchlog, Counters…") + Hints, Local Write-Read
>> Paths + Core)
>> In others, there’s a lack of granularity (Streaming + Messaging, Core,
>> Coordination, Distributed Metadata)
>> In others, there’s a lack of clarity (Core, Lifecycle, Coordination)
>> Others are probably missing entirely (Transient Replication, …?)
>> 
>> Labels are also used fairly haphazardly, and there’s no clear definition
>> of “priority”
>> 
>> Perhaps we should form a working group to suggest a methodology for
>> filling out JIRA, standardise the necessary components, labels etc, and put
>> together a wiki page with step-by-step instructions on how to do it?
>> 
>> 
>>> On 20 Sep 2018, at 15:29, Joshua McKenzie  wrote:
>>> 
>>> I've spent a good bit of time thinking about the above and bounced off
>> both
>>> different ways to measure quality and progress as well as trying to
>>> influence community behavior on this topic. My advice: start small and
>>> simple (KISS, YAGNI, 

Re: Measuring Release Quality

2018-09-22 Thread kurt greaves
I'm interested. Better defining the components and labels we use in our
docs would be a good start and LHF. I'd prefer if we kept all the
information within JIRA through the use of fields/labels though, and
generated reports off those tags. Keeping all the information in one place
is much better in my experience. Not applicable for CI obviously, but
ideally we can generate testing reports directly from the testing systems.

I don't see this as a huge amount of work so I think the overall risk is
pretty small, especially considering it can easily be done in a way that
doesn't affect anyone until we get consensus on methodology.



On Sat, 22 Sep 2018 at 03:44, Scott Andreas  wrote:

> Josh, thanks for reading and sharing feedback. Agreed with starting simple
> and measuring inputs that are high-signal; that’s a good place to begin.
>
> To the challenge of building consensus, point taken + agreed. Perhaps the
> distinction is between producing something that’s “useful” vs. something
> that’s “authoritative” for decisionmaking purposes. My motivation is to
> work toward something “useful” (as measured by the value contributors
> find). I’d be happy to start putting some of these together as part of an
> experiment – and agreed on evaluating “value relative to cost” after we see
> how things play out.
>
> To Benedict’s point on JIRA, agreed that plotting a value from messy input
> wouldn’t produce useful output. Some questions a small working group might
> take on toward better categorization might look like:
>
> –––
> – Revisiting the list of components: e.g., “Core” captures a lot right now.
> – Revisiting which fields should be required when filing a ticket – and if
> there are any that should be removed from the form.
> – Reviewing active labels: understanding what people have been trying to
> capture, and how they could be organized + documented better.
> – Documenting “priority”: (e.g., a common standard we can point to, even
> if we’re pretty good now).
> – Considering adding a "severity” field to capture the distinction between
> priority and severity.
> –––
>
> If there’s appetite for spending a little time on this, I’d put effort
> toward it if others are interested; is anyone?
>
> Otherwise, I’m equally fine with an experiment to measure basics via the
> current structure as Josh mentioned, too.
>
> – Scott
>
>
> On September 20, 2018 at 8:22:55 AM, Benedict Elliott Smith (
> bened...@apache.org) wrote:
>
> I think it would be great to start getting some high quality info out of
> JIRA, but I think we need to clean up and standardise how we use it to
> facilitate this.
>
> Take the Component field as an example. This is the current list of
> options:
>
> 4.0
> Auth
> Build
> Compaction
> Configuration
> Core
> CQL
> Distributed Metadata
> Documentation and Website
> Hints
> Libraries
> Lifecycle
> Local Write-Read Paths
> Materialized Views
> Metrics
> Observability
> Packaging
> Repair
> SASI
> Secondary Indexes
> Streaming and Messaging
> Stress
> Testing
> Tools
>
> In some cases there's duplication (Metrics + Observability, Coordination
> (=“Storage Proxy, Hints, Batchlog, Counters…") + Hints, Local Write-Read
> Paths + Core)
> In others, there’s a lack of granularity (Streaming + Messaging, Core,
> Coordination, Distributed Metadata)
> In others, there’s a lack of clarity (Core, Lifecycle, Coordination)
> Others are probably missing entirely (Transient Replication, …?)
>
> Labels are also used fairly haphazardly, and there’s no clear definition
> of “priority”
>
> Perhaps we should form a working group to suggest a methodology for
> filling out JIRA, standardise the necessary components, labels etc, and put
> together a wiki page with step-by-step instructions on how to do it?
>
>
> > On 20 Sep 2018, at 15:29, Joshua McKenzie  wrote:
> >
> > I've spent a good bit of time thinking about the above and bounced off
> both
> > different ways to measure quality and progress as well as trying to
> > influence community behavior on this topic. My advice: start small and
> > simple (KISS, YAGNI, all that). Get metrics for pass/fail on
> > utest/dtest/flakiness over time, perhaps also aggregate bug count by
> > component over time. After spending a predetermined time doing that (a
> > couple months?) as an experiment, we retrospect as a project and see if
> > these efforts are adding value commensurate with the time investment
> > required to perform the measurement and analysis.
> >
> > There's a lot of really good ideas in that linked wiki article / this
> email
> > thread. The biggest challenge, and risk of failure, is in translating
> good
> > ideas into action and selling project participants on the value of
> changing
> > their behavior. The latter is where we've fallen short over the years;
> > building consensus (especially regarding process /shudder) is Very Hard.
> >
> > Also - thanks for spearheading this discussion Scott. It's one we come
> back
> > to with some 

Re: Proposing an Apache Cassandra Management process

2018-09-22 Thread kurt greaves
Is this something we're moving ahead with despite the feature freeze?

On Sat, 22 Sep 2018 at 08:32, dinesh.jo...@yahoo.com.INVALID
 wrote:

> I have created a sub-task - CASSANDRA-14783. Could we get some feedback
> before we begin implementing anything?
>
> Dinesh
>
> On Thursday, September 20, 2018, 11:22:33 PM PDT, Dinesh Joshi <
> dinesh.jo...@yahoo.com.INVALID> wrote:
>
>  I have updated the doc with a short paragraph providing the
> clarification. Sankalp's suggestion is already part of the doc. If there
> aren't further objections could we move this discussion over to the jira
> (CASSANDRA-14395)?
>
> Dinesh
>
> > On Sep 18, 2018, at 10:31 AM, sankalp kohli 
> wrote:
> >
> > How about we start with a few basic features in side car. How about
> starting with this
> > 1. Bulk nodetool commands: User can curl any sidecar and be able to run
> a nodetool command in bulk across the cluster.
> >
> :/bulk/nodetool/tablestats?arg0=keyspace_name.table_name= required>
> >
> > And later
> > 2: Health checks.
> >
> > On Thu, Sep 13, 2018 at 11:34 AM dinesh.jo...@yahoo.com.INVALID <
> dinesh.jo...@yahoo.com.invalid> wrote:
> > I will update the document to add that point. The document did not mean
> to serve as a design or architectural document but rather something that
> would spark a discussion on the idea.
> > Dinesh
> >
> >On Thursday, September 13, 2018, 10:59:34 AM PDT, Jonathan Haddad <
> j...@jonhaddad.com > wrote:
> >
> >  Most of the discussion and work was done off the mailing list - there's
> a
> > big risk involved when folks disappear for months at a time and resurface
> > with big pile of code plus an agenda that you failed to loop everyone in
> > on. In addition, by your own words the design document didn't accurately
> > describe what was being built.  I don't write this to try to argue about
> > it, I just want to put some perspective for those of us that weren't part
> > of this discussion on a weekly basis over the last several months.  Going
> > forward let's keep things on the ML so we can avoid confusion and
> > frustration for all parties.
> >
> > With that said - I think Blake made a really good point here and it's
> > helped me understand the scope of what's being built better.  Looking at
> it
> > from a different perspective it doesn't seem like there's as much overlap
> > as I had initially thought.  There's the machinery that runs certain
> tasks
> > (what Joey has been working on) and the user facing side of exposing that
> > information in management tool.
> >
> > I do appreciate (and like) the idea of not trying to boil the ocean, and
> > working on things incrementally.  Putting a thin layer on top of
> Cassandra
> > that can perform cluster wide tasks does give us an opportunity to move
> in
> > the direction of a general purpose user-facing admin tool without
> > committing to trying to write the full stack all at once (or even make
> > decisions on it now).  We do need a sensible way of doing rolling
> restarts
> > / scrubs / scheduling and Reaper wasn't built for that, and even though
> we
> > can add it I'm not sure if it's the best mechanism for the long term.
> >
> > So if your goal is to add maturity to the project by making cluster wide
> > tasks easier by providing a framework to build on top of, I'm in favor of
> > that and I don't see it as antithetical to what I had in mind with
> Reaper.
> > Rather, the two are more complementary than I had originally realized.
> >
> > Jon
> >
> >
> >
> >
> > On Thu, Sep 13, 2018 at 10:39 AM dinesh.jo...@yahoo.com.INVALID
> > mailto:dinesh.jo...@yahoo.com>.invalid> wrote:
> >
> > > I have a few clarifications -
> > > The scope of the management process is not to simply run repair
> > > scheduling. Repair scheduling is one of the many features we could
> > > implement or adopt from existing sources. So could we please split the
> > > Management Process discussion and the repair scheduling?
> > > After re-reading the management process proposal, I see we missed to
> > > communicate a basic idea in the document. We wanted to take a pluggable
> > > approach to various activities that the management process could
> perform.
> > > This could accommodate different implementations of common activities
> such
> > > as repair. The management process would provide the basic framework
> and it
> > > would have default implementations for some of the basic activities.
> This
> > > would allow for speedier iteration cycles and keep things extensible.
> > > Turning to some questions that Jon and others have raised, when I +1,
> my
> > > intention is to fully contribute and stay with this community. That
> said,
> > > things feel rushed for some but for me it feels like analysis
> paralysis.
> > > We're looking for actionable feedback and to discuss the management
> process
> > > _not_ repair scheduling solutions.
> > > Thanks,
> > > Dinesh
> > >
> > >
> > >
> > > On Sep 12, 2018, at 6:24 PM, sankalp kohli  

Re: [DISCUSS] changing default token behavior for 4.0

2018-09-22 Thread kurt greaves
+1. I've been making a case for this for some time now, and was actually a
focus of my talk last week. I'd be very happy to get this into 4.0.

We've tested various num_tokens with the algorithm on various sized
clusters and we've found that typically 16 works best. With lower numbers
we found that balance is good initially but as a cluster gets larger you
have some problems. E.g We saw that on a 60 node cluster with 8 tokens per
node we were seeing a difference of 22% in token ownership, but on a <=12
node cluster a difference of only 12%. 16 tokens on the other hand wasn't
perfect but generally gave a better balance regardless of cluster size at
least up to 100 nodes. TBH we should probably do some proper testing and
record all the results for this before we pick a default (I'm happy to do
this - think we can use the original testing script for this).

But anyway, I'd say Jon is on the right track. Personally how I'd like to
see it is that we:

   1. Change allocate_tokens_for_keyspace to allocate_tokens_for_rf in the
   same way that DSE does it. Allowing a user to specify a RF to allocate
   from, and allowing multiple DC's.
   2. Add a new boolean property random_token_allocation, defaults to false.
   3. Make allocate_tokens_for_rf default to *unset**.
   4. Make allocate_tokens_for_rf *required*** if num_tokens > 1 and
   random_token_allocation != true.
   5. Default num_tokens to 16 (or whatever we find appropriate)

* I think setting a default is asking for trouble. When people are going to
add new DC's/nodes we don't want to risk them adding a node with the wrong
RF. I think it's safe to say that a user should have to think about this
before they spin up their cluster.
** Following above, it should be required to be set so that we don't have
people accidentally using random allocation. I think we should really be
aiming to get rid of random allocation completely, but provide a new
property to enable it for backwards compatibility (also for testing).

It's worth noting that a smaller number of tokens *theoretically* decreases
the time for replacement/rebuild, so if we're considering QUORUM
availability with vnodes there's an argument against having a very low
num_tokens. I think it's better to utilise NTS and racks to reduce the
chance of a QUORUM outage over banking on having a lower number of tokens,
as with just a low number of tokens unless you go all the way to 1 you are
just relying on luck that 2 nodes don't overlap. Guess what I'm saying is
that I think we should be choosing a num_tokens that gives the best
distribution for most cluster sizes rather than choosing one that
"decreases" the probability of an outage.

Also I think we should continue using CASSANDRA-13701 to track this. TBH I
think in general we should be a bit better at searching for and using
existing tickets...

On Sat, 22 Sep 2018 at 18:13, Stefan Podkowinski  wrote:

> There already have been some discussions on this here:
> https://issues.apache.org/jira/browse/CASSANDRA-13701
>
> The mentioned blocker there on the token allocation shouldn't exist
> anymore. Although it would be good to get more feedback on it, in case
> we want to enable it by default, along with new defaults for number of
> tokens.
>
>
> On 22.09.18 06:30, Dinesh Joshi wrote:
> > Jon, thanks for starting this thread!
> >
> > I have created CASSANDRA-14784 to track this.
> >
> > Dinesh
> >
> >> On Sep 21, 2018, at 9:18 PM, Sankalp Kohli 
> wrote:
> >>
> >> Putting it on JIRA is to make sure someone is assigned to it and it is
> tracked. Changes should be discussed over ML like you are saying.
> >>
> >> On Sep 21, 2018, at 21:02, Jonathan Haddad  wrote:
> >>
>  We should create a JIRA to find what other defaults we need revisit.
> >>> Changing a default is a pretty big deal, I think we should discuss any
> >>> changes to defaults here on the ML before moving it into JIRA.  It's
> nice
> >>> to get a bit more discussion around the change than what happens in
> JIRA.
> >>>
> >>> We (TLP) did some testing on 4 tokens and found it to work surprisingly
> >>> well.   It wasn't particularly formal, but we verified the load stays
> >>> pretty even with only 4 tokens as we added nodes to the cluster.
> Higher
> >>> token count hurts availability by increasing the number of nodes any
> given
> >>> node is a neighbor with, meaning any 2 nodes that fail have an
> increased
> >>> chance of downtime when using QUORUM.  In addition, with the recent
> >>> streaming optimization it seems the token counts will give a greater
> chance
> >>> of a node streaming entire sstables (with LCS), meaning we'll do a
> better
> >>> job with node density out of the box.
> >>>
> >>> Next week I can try to put together something a little more convincing.
> >>> Weekend time.
> >>>
> >>> Jon
> >>>
> >>>
> >>> On Fri, Sep 21, 2018 at 8:45 PM sankalp kohli 
> >>> wrote:
> >>>
>  +1 to lowering it.
>  Thanks Jon for starting this.We should create a JIRA to find what
> other
> 

Re: [DISCUSS] changing default token behavior for 4.0

2018-09-22 Thread Stefan Podkowinski
There already have been some discussions on this here:
https://issues.apache.org/jira/browse/CASSANDRA-13701

The mentioned blocker there on the token allocation shouldn't exist
anymore. Although it would be good to get more feedback on it, in case
we want to enable it by default, along with new defaults for number of
tokens.


On 22.09.18 06:30, Dinesh Joshi wrote:
> Jon, thanks for starting this thread!
>
> I have created CASSANDRA-14784 to track this. 
>
> Dinesh
>
>> On Sep 21, 2018, at 9:18 PM, Sankalp Kohli  wrote:
>>
>> Putting it on JIRA is to make sure someone is assigned to it and it is 
>> tracked. Changes should be discussed over ML like you are saying. 
>>
>> On Sep 21, 2018, at 21:02, Jonathan Haddad  wrote:
>>
 We should create a JIRA to find what other defaults we need revisit.
>>> Changing a default is a pretty big deal, I think we should discuss any
>>> changes to defaults here on the ML before moving it into JIRA.  It's nice
>>> to get a bit more discussion around the change than what happens in JIRA.
>>>
>>> We (TLP) did some testing on 4 tokens and found it to work surprisingly
>>> well.   It wasn't particularly formal, but we verified the load stays
>>> pretty even with only 4 tokens as we added nodes to the cluster.  Higher
>>> token count hurts availability by increasing the number of nodes any given
>>> node is a neighbor with, meaning any 2 nodes that fail have an increased
>>> chance of downtime when using QUORUM.  In addition, with the recent
>>> streaming optimization it seems the token counts will give a greater chance
>>> of a node streaming entire sstables (with LCS), meaning we'll do a better
>>> job with node density out of the box.
>>>
>>> Next week I can try to put together something a little more convincing.
>>> Weekend time.
>>>
>>> Jon
>>>
>>>
>>> On Fri, Sep 21, 2018 at 8:45 PM sankalp kohli 
>>> wrote:
>>>
 +1 to lowering it.
 Thanks Jon for starting this.We should create a JIRA to find what other
 defaults we need revisit. (Please keep this discussion for "default token"
 only.  )

> On Fri, Sep 21, 2018 at 8:26 PM Jeff Jirsa  wrote:
>
> Also agree it should be lowered, but definitely not to 1, and probably
> something closer to 32 than 4.
>
> --
> Jeff Jirsa
>
>
>> On Sep 21, 2018, at 8:24 PM, Jeremy Hanna 
> wrote:
>> I agree that it should be lowered. What I’ve seen debated a bit in the
> past is the number but I don’t think anyone thinks that it should remain
> 256.
>>> On Sep 21, 2018, at 7:05 PM, Jonathan Haddad 
 wrote:
>>> One thing that's really, really bothered me for a while is how we
> default
>>> to 256 tokens still.  There's no experienced operator that leaves it
 as
> is
>>> at this point, meaning the only people using 256 are the poor folks
 that
>>> just got started using C*.  I've worked with over a hundred clusters
 in
> the
>>> last couple years, and I think I only worked with one that had lowered
> it
>>> to something else.
>>>
>>> I think it's time we changed the default to 4 (or 8, up for debate).
>>>
>>> To improve the behavior, we need to change a couple other things.  The
>>> allocate_tokens_for_keyspace setting is... odd.  It requires you have
 a
>>> keyspace already created, which doesn't help on new clusters.  What
 I'd
>>> like to do is add a new setting, allocate_tokens_for_rf, and set it to
> 3 by
>>> default.
>>>
>>> To handle clusters that are already using 256 tokens, we could prevent
> the
>>> new node from joining unless a -D flag is set to explicitly allow
>>> imbalanced tokens.
>>>
>>> We've agreed to a trunk freeze, but I feel like this is important
 enough
>>> (and pretty trivial) to do now.  I'd also personally characterize this
> as a
>>> bug fix since 256 is horribly broken when the cluster gets to any
>>> reasonable size, but maybe I'm alone there.
>>>
>>> I honestly can't think of a use case where random tokens is a good
> choice
>>> anymore, so I'd be fine / ecstatic with removing it completely and
>>> requiring either allocate_tokens_for_keyspace (for existing clusters)
>>> or allocate_tokens_for_rf
>>> to be set.
>>>
>>> Thoughts?  Objections?
>>> --
>>> Jon Haddad
>>> http://www.rustyrazorblade.com
>>> twitter: rustyrazorblade
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>
>>>
>>> -- 
>>> Jon Haddad
>>> http://www.rustyrazorblade.com
>>>