Re: [VOTE] Release Apache Cassandra 4.0-alpha3

2020-02-03 Thread Michael Shuler

On 2/3/20 5:21 PM, Mick Semb Wever wrote:



Summary of notes:
- Artifact set checks out OK with regards to key sigs and checksums.
- CASSANDRA-14962 is an issue when not using the current deb build
method (using new docker method results in different source artifact
creation & use). The docker rpm build suffers the same source problem
and the src.rpm is significantly larger, since I think it copies all the
downloaded maven artifacts in. It's fine for now, though :)
- UNRELEASED deb build



Thanks for the thorough review Michael.

I did not know about CASSANDRA-14962, but it should be easy to fix now that the 
-src.tar.gz is in the dev dist location and easy to re-use. I'll see if I can 
create a patch for that (aiming to use it on alpha4).


Yep! Similarly, the rpm build has been wrong all along, but it's what we 
have. The -src.tar.gz should get copied to /$build/$path/SOURCE dir, I 
think it is(?). I think that might cure the larger .src.rpm.



And I was unaware of the UNRELEASED version issue. I can put a patch in for 
that too, going into the prepare_release.sh script.


`dch -r` is usually a step I do before building, also checking NEWS and 
CHANGES and build.xml versions all align. Then the correct commit gets 
-tentative tagged. Building `dch -r` in would be OK, if all the other 
ducks are in a row.



Next step
would be to do each package-type install and startup functional testing,
but I don't have that time right now :)



I'm going to presume others that have voted have done package-type installs and 
the basic testing, and move ahead. If I close the vote, I will need your help 
Michael with the final steps running the patched finish_release.sh from the 
`mck/14970_sha512-checksums` branch, found in 
https://github.com/thelastpickle/cassandra-builds/blob/mck/14970_sha512-checksums/
     Because only PMC can `svn move` the files into 
dist.apache.org/repos/dist/release/


I usually do this before vote. I don't know how many other people, if 
any, test that all the packages can install and start.



And for the upload_bintray.sh script, how do I get credentials, an infra ticket 
i presume? (ie to https://bintray.com/apache )


If I recall, I did an infra ticket with my github user id - this is how 
I log in. Once logged into bintray, you can find a token down in the 
user profile somewhere, which is used in the script.


Thanks again for walking through these steps.

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread David Capwell
Following Mick's format =)

** Lack of trust (aka reliability)

Mick said it best, but should also add that we have slow tests and tests
which don't do anything.  Effort is needed to improve our current tests and
to make sure future tests are stable (cleaning up works, isolation, etc.);
this is not a neglectable amount of work, nor work which can be done by a
single person.

** Lack of resources (throughput and response)

Our slowest unit tests are around 2 minutes (materialized views), our
slowest dtests (not high resource) are around 30 minutes; given enough
resources we could run unit in < 10 minutes and dtest in 30-60 minutes.

There is also another thing to point out, testing is also a combinatorics
problem; we support java 8/11 (more to come), vnode and no-vnode, security
and no security, and the list goes on.  Bugs are more likely to happen when
two features interact, so it is important to test against many combinations.

There is work going on in the community to add new kinds of tests (harry,
diff, etc.); these tests require even more resources than normal tests.

** Difficulty in use

Many people rely on CircleCI as the core CI for the project, but this has a
few issues called out in other forms: the low resource version (free) is
even more flaky than high (paid), and people get locked out (i have lost
access twice so far, others have said the same).

The thing which worries me the most is that new members to the project
won't have the high resource CircleCI plan, nor do they really have access
to Jenkins.  This puts a burden on new authors where they wait 24+ hours to
run the tests... or just not run them.

** Lack of visibility into quality

This is two things for me: commit and pre-commit.

For commit, this is more what Mick was referring to as "post-commit CI".
There are a few questions I would like to know about our current tests
(report most flaky tests, which sections of code cause the most failures,
etc.); these are hard to answer at the moment .

We don't have a good pre-commit story since it mostly relies on CircleCI.
I find that some JIRAs link CircleCI and some don't.  I find that if I
follow the CircleCI link months later (to see if the build was stable
pre-commit) that Circle fails to show the workflow.

On Mon, Feb 3, 2020 at 3:42 PM Michael Shuler 
wrote:

> Only have a moment to respond, but Mick hit the higlights with
> containerization, parallelization, these help solve cleanup, speed, and
> cascading failures. Dynamic disposable slaves would be icing on that
> cake, which may require a dedicated master.
>
> One more note on jobs, or more correctly unnecessary jobs - pipelines
> have a `changeset` build condition we should tinker with. There is zero
> reason to run a job with no actual code diff. For instance, I committed
> to 2.1 this morning and merged `-s ours` nothing to the newer branches -
> there's really no reason to run and take up valuable resources with no
> actual diff changes.
> https://jenkins.io/doc/book/pipeline/syntax/#built-in-conditions
>
> Michael
>
> On 2/3/20 3:45 PM, Nate McCall wrote:
> > Mick, this is fantastic!
> >
> > I'll wait another day to see if anyone else chimes in. (Would also love
> to
> > hear from CassCI folks, anyone else really who has wrestled with this
> even
> > for internal forks).
> >
> > On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever  wrote:
> >
> >> Nate, I leave it to you to forward what-you-chose to the board@'s
> thread.
> >>
> >>
> >>> Are there still troubles and what are they?
> >>
> >>
> >> TL;DR
> >>the ASF could provide the Cassandra community with an isolated
> jenkins
> >> installation: so that we can manage and control the Jenkins master,  as
> >> well as ensure all donated hardware for Jenkins agents are dedicated and
> >> isolated to us.
> >>
> >>
> >> The long writeup…
> >>
> >> For Cassandra's use of ASF's Jenkins I see the following problems.
> >>
> >> ** Lack of trust (aka reliability)
> >>
> >> The Jenkins agents re-use their workspaces, as opposed to using new
> >> containers per test run, leading to broken agents, disks, git clones,
> etc.
> >> One broken test run, or a broken agent, too easily affects subsequent
> test
> >> executions.
> >>
> >> The complexity (and flakiness) around our tests is a real problem.  CI
> on
> >> a project like Cassandra is a beast and the community is very limited in
> >> what it can do, it really needs the help of larger companies. Effort is
> >> required in fixing the broken, the flakey, and the ignored tests.
> >> Parallelising the tests will help by better isolating failures, but
> tests
> >> (and their execution scripts) also need to be better at cleaning up
> after
> >> themselves, or a more container approach needs to be taken.
> >>
> >> Another issue is that other projects sometimes using the agents, and
> Infra
> >> sometimes edits our build configurations (out of necessity).
> >>
> >>
> >> ** Lack of resources (throughput and response)
> >>
> >> Having only 9 

Feedback from the last Apache Cassandra Contributor Meeting

2020-02-03 Thread Patrick McFadin
Hi everyone,

One action item I took from our first contributor meeting was gather
feedback for the next meetings. I've created a short survey if you would
like to offer feedback. I'll let it run for the week and report back on the
results.

https://www.surveymonkey.com/r/C95B7ZP

Thanks,

Patrick


Re: [VOTE] Release Apache Cassandra 4.0-alpha3

2020-02-03 Thread Dinesh Joshi
+1, this at least starts up on Windows ;)

Dinesh

> On Feb 3, 2020, at 3:21 PM, Mick Semb Wever  wrote:
> 
> 
>> Summary of notes:
>> - Artifact set checks out OK with regards to key sigs and checksums.
>> - CASSANDRA-14962 is an issue when not using the current deb build 
>> method (using new docker method results in different source artifact 
>> creation & use). The docker rpm build suffers the same source problem 
>> and the src.rpm is significantly larger, since I think it copies all the 
>> downloaded maven artifacts in. It's fine for now, though :)
>> - UNRELEASED deb build
> 
> 
> Thanks for the thorough review Michael.
> 
> I did not know about CASSANDRA-14962, but it should be easy to fix now that 
> the -src.tar.gz is in the dev dist location and easy to re-use. I'll see if I 
> can create a patch for that (aiming to use it on alpha4).
> 
> And I was unaware of the UNRELEASED version issue. I can put a patch in for 
> that too, going into the prepare_release.sh script. 
> 
> 
>> Next step 
>> would be to do each package-type install and startup functional testing, 
>> but I don't have that time right now :)
> 
> 
> I'm going to presume others that have voted have done package-type installs 
> and the basic testing, and move ahead. If I close the vote, I will need your 
> help Michael with the final steps running the patched finish_release.sh from 
> the `mck/14970_sha512-checksums` branch, found in 
> https://github.com/thelastpickle/cassandra-builds/blob/mck/14970_sha512-checksums/
> Because only PMC can `svn move` the files into 
> dist.apache.org/repos/dist/release/ 
> 
> And for the upload_bintray.sh script, how do I get credentials, an infra 
> ticket i presume? (ie to https://bintray.com/apache )
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread Michael Shuler
Only have a moment to respond, but Mick hit the higlights with 
containerization, parallelization, these help solve cleanup, speed, and 
cascading failures. Dynamic disposable slaves would be icing on that 
cake, which may require a dedicated master.


One more note on jobs, or more correctly unnecessary jobs - pipelines 
have a `changeset` build condition we should tinker with. There is zero 
reason to run a job with no actual code diff. For instance, I committed 
to 2.1 this morning and merged `-s ours` nothing to the newer branches - 
there's really no reason to run and take up valuable resources with no 
actual diff changes.

https://jenkins.io/doc/book/pipeline/syntax/#built-in-conditions

Michael

On 2/3/20 3:45 PM, Nate McCall wrote:

Mick, this is fantastic!

I'll wait another day to see if anyone else chimes in. (Would also love to
hear from CassCI folks, anyone else really who has wrestled with this even
for internal forks).

On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever  wrote:


Nate, I leave it to you to forward what-you-chose to the board@'s thread.



Are there still troubles and what are they?



TL;DR
   the ASF could provide the Cassandra community with an isolated jenkins
installation: so that we can manage and control the Jenkins master,  as
well as ensure all donated hardware for Jenkins agents are dedicated and
isolated to us.


The long writeup…

For Cassandra's use of ASF's Jenkins I see the following problems.

** Lack of trust (aka reliability)

The Jenkins agents re-use their workspaces, as opposed to using new
containers per test run, leading to broken agents, disks, git clones, etc.
One broken test run, or a broken agent, too easily affects subsequent test
executions.

The complexity (and flakiness) around our tests is a real problem.  CI on
a project like Cassandra is a beast and the community is very limited in
what it can do, it really needs the help of larger companies. Effort is
required in fixing the broken, the flakey, and the ignored tests.
Parallelising the tests will help by better isolating failures, but tests
(and their execution scripts) also need to be better at cleaning up after
themselves, or a more container approach needs to be taken.

Another issue is that other projects sometimes using the agents, and Infra
sometimes edits our build configurations (out of necessity).


** Lack of resources (throughput and response)

Having only 9 agents: none of which can run the large dtests; is a
problem. All 9 are from Instaclustr, much kudos! Three companies recently
have said they will donate resources, this is work in progress.

We have four release branches where we would like to provide per-commit
post-commit testing. Each complete test execution currently take 24hr+.
Parallelising tests atm won't help much as the agents are generally
saturated (with the pipelines doing the top-level parallelisation). Once we
get more hardware in place: for the sake of improving throughput; it will
make sense to look into parallelising the tests more.

The throughput of tests will also improve with effort put into
removing/rewriting long running and inefficient tests. Also, and i think
this is LHF, throughput could be improved by using (or taking inspiration
from) Apache Yetus so to only run tests on what it relevant in the
patch/commit. Ref:
http://yetus.apache.org/documentation/0.11.1/precommit-basic/


** Difficulty in use

Jenkins is clumsy to use compared to the CI systems we use more often
today: Travis, CircleCI, GH Actions.

One of the complaints has been that only committers can kick off CI for
patches (ie pre-commit CI runs).  But I don't believe this to be a crucial
issue for a number of reasons.

1. Thorough CI testing of a patch only needs to happen during the review
process, to which a committer needs to be involved in anyway.
2.  We don't have enough jenkins agents to handle the amount of throughput
that automated branch/patch/pull-request testing would require.
3. Our tests could allow unknown contributors to take ownership of the
agent servers (eg via the execution of bash scripts).
4. We have CircleCI working that provides basic testing for
work-in-progress patches.


Focusing on post-commit CI and having canonical results for our release
branches, i think then it boils down to the stability and throughput of
tests, and the persistence and permanence of results.

The persistence and permanence of results is a bug bear for me. It has
been partially addressed with posting the build results to the builds@
ML. But this only provides a (pretty raw) summary of the results. I'm keen
to take the next step of the posting of CI results back to committed jira
tickets (but am waiting on seeing Jenkins run stable for a while).  If we
had our own Jenkins master we could then look into retaining more/all build
results. Being able to see the longer term trends of test results and well
as execution times I hope would add the incentive to get more folk involved.

Looping back to 

Re: [VOTE] Release Apache Cassandra 4.0-alpha3

2020-02-03 Thread Mick Semb Wever


> Summary of notes:
> - Artifact set checks out OK with regards to key sigs and checksums.
> - CASSANDRA-14962 is an issue when not using the current deb build 
> method (using new docker method results in different source artifact 
> creation & use). The docker rpm build suffers the same source problem 
> and the src.rpm is significantly larger, since I think it copies all the 
> downloaded maven artifacts in. It's fine for now, though :)
> - UNRELEASED deb build


Thanks for the thorough review Michael.

I did not know about CASSANDRA-14962, but it should be easy to fix now that the 
-src.tar.gz is in the dev dist location and easy to re-use. I'll see if I can 
create a patch for that (aiming to use it on alpha4).

And I was unaware of the UNRELEASED version issue. I can put a patch in for 
that too, going into the prepare_release.sh script. 


> Next step 
> would be to do each package-type install and startup functional testing, 
> but I don't have that time right now :)


I'm going to presume others that have voted have done package-type installs and 
the basic testing, and move ahead. If I close the vote, I will need your help 
Michael with the final steps running the patched finish_release.sh from the 
`mck/14970_sha512-checksums` branch, found in 
https://github.com/thelastpickle/cassandra-builds/blob/mck/14970_sha512-checksums/
    Because only PMC can `svn move` the files into 
dist.apache.org/repos/dist/release/ 

And for the upload_bintray.sh script, how do I get credentials, an infra ticket 
i presume? (ie to https://bintray.com/apache )

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-03 Thread Jon Haddad
I think it's a good idea to take a step back and get a high level view of
the problem we're trying to solve.

First, high token counts result in decreased availability as each node has
data overlap with with more nodes in the cluster.  Specifically, a node can
share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at RF=3 is
going to almost always share data with every other node in the cluster that
isn't in the same rack, unless you're doing something wild like using more
than a thousand nodes in a cluster.  We advertise

With 16 tokens, that is vastly improved, but you still have up to 64 nodes
each node needs to query against, so you're again, hitting every node
unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
wouldn't use 16 here, and I doubt any of you would either.  I've advocated
for 4 tokens because you'd have overlap with only 16 nodes, which works
well for small clusters as well as large.  Assuming I was creating a new
cluster for myself (in a hypothetical brand new application I'm building) I
would put this in production.  I have worked with several teams where I
helped them put 4 token clusters in prod and it has worked very well.  We
didn't see any wild imbalance issues.

As Mick's pointed out, our current method of using random token assignment
for the default number of problematic for 4 tokens.  I fully agree with
this, and I think if we were to try to use 4 tokens, we'd want to address
this in tandem.  We can discuss how to better allocate tokens by default
(something more predictable than random), but I'd like to avoid the
specifics of that for the sake of this email.

To Alex's point, repairs are problematic with lower token counts due to
over streaming.  I think this is a pretty serious issue and I we'd have to
address it before going all the way down to 4.  This, in my opinion, is a
more complex problem to solve and I think trying to fix it here could make
shipping 4.0 take even longer, something none of us want.

For the sake of shipping 4.0 without adding extra overhead and time, I'm ok
with moving to 16 tokens, and in the process adding extensive documentation
outlining what we recommend for production use.  I think we should also try
to figure out something better than random as the default to fix the data
imbalance issues.  I've got a few ideas here I've been noodling on.

As long as folks are fine with potentially changing the default again in C*
5.0 (after another discussion / debate), 16 is enough of an improvement
that I'm OK with the change, and willing to author the docs to help people
set up their first cluster.  For folks that go into production with the
defaults, we're at least not setting them up for total failure once their
clusters get large like we are now.

In future versions, we'll probably want to address the issue of data
imbalance by building something in that shifts individual tokens around.  I
don't think we should try to do this in 4.0 either.

Jon



On Fri, Jan 31, 2020 at 2:04 PM Jeremy Hanna 
wrote:

> I think Mick and Anthony make some valid operational and skew points for
> smaller/starting clusters with 4 num_tokens. There’s an arbitrary line
> between small and large clusters but I think most would agree that most
> clusters are on the small to medium side. (A small nuance is afaict the
> probabilities have to do with quorum on a full token range, ie it has to do
> with the size of a datacenter not the full cluster
>
> As I read this discussion I’m personally more inclined to go with 16 for
> now. It’s true that if we could fix the skew and topology gotchas for those
> starting things up, 4 would be ideal from an availability perspective.
> However we’re still in the brainstorming stage for how to address those
> challenges. I think we should create tickets for those issues and go with
> 16 for 4.0.
>
> This is about an out of the box experience. It balances availability,
> operations (such as skew and general bootstrap friendliness and
> streaming/repair), and cluster sizing. Balancing all of those, I think for
> now I’m more comfortable with 16 as the default with docs on considerations
> and tickets to unblock 4 as the default for all users.
>
> >>> On Feb 1, 2020, at 6:30 AM, Jeff Jirsa  wrote:
> >> On Fri, Jan 31, 2020 at 11:25 AM Joseph Lynch 
> wrote:
> >> I think that we might be bikeshedding this number a bit because it is
> easy
> >> to debate and there is not yet one right answer.
> >
> >
> > https://www.youtube.com/watch?v=v465T5u9UKo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread Nate McCall
Mick, this is fantastic!

I'll wait another day to see if anyone else chimes in. (Would also love to
hear from CassCI folks, anyone else really who has wrestled with this even
for internal forks).

On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever  wrote:

> Nate, I leave it to you to forward what-you-chose to the board@'s thread.
>
>
> > Are there still troubles and what are they?
>
>
> TL;DR
>   the ASF could provide the Cassandra community with an isolated jenkins
> installation: so that we can manage and control the Jenkins master,  as
> well as ensure all donated hardware for Jenkins agents are dedicated and
> isolated to us.
>
>
> The long writeup…
>
> For Cassandra's use of ASF's Jenkins I see the following problems.
>
> ** Lack of trust (aka reliability)
>
> The Jenkins agents re-use their workspaces, as opposed to using new
> containers per test run, leading to broken agents, disks, git clones, etc.
> One broken test run, or a broken agent, too easily affects subsequent test
> executions.
>
> The complexity (and flakiness) around our tests is a real problem.  CI on
> a project like Cassandra is a beast and the community is very limited in
> what it can do, it really needs the help of larger companies. Effort is
> required in fixing the broken, the flakey, and the ignored tests.
> Parallelising the tests will help by better isolating failures, but tests
> (and their execution scripts) also need to be better at cleaning up after
> themselves, or a more container approach needs to be taken.
>
> Another issue is that other projects sometimes using the agents, and Infra
> sometimes edits our build configurations (out of necessity).
>
>
> ** Lack of resources (throughput and response)
>
> Having only 9 agents: none of which can run the large dtests; is a
> problem. All 9 are from Instaclustr, much kudos! Three companies recently
> have said they will donate resources, this is work in progress.
>
> We have four release branches where we would like to provide per-commit
> post-commit testing. Each complete test execution currently take 24hr+.
> Parallelising tests atm won't help much as the agents are generally
> saturated (with the pipelines doing the top-level parallelisation). Once we
> get more hardware in place: for the sake of improving throughput; it will
> make sense to look into parallelising the tests more.
>
> The throughput of tests will also improve with effort put into
> removing/rewriting long running and inefficient tests. Also, and i think
> this is LHF, throughput could be improved by using (or taking inspiration
> from) Apache Yetus so to only run tests on what it relevant in the
> patch/commit. Ref:
> http://yetus.apache.org/documentation/0.11.1/precommit-basic/
>
>
> ** Difficulty in use
>
> Jenkins is clumsy to use compared to the CI systems we use more often
> today: Travis, CircleCI, GH Actions.
>
> One of the complaints has been that only committers can kick off CI for
> patches (ie pre-commit CI runs).  But I don't believe this to be a crucial
> issue for a number of reasons.
>
> 1. Thorough CI testing of a patch only needs to happen during the review
> process, to which a committer needs to be involved in anyway.
> 2.  We don't have enough jenkins agents to handle the amount of throughput
> that automated branch/patch/pull-request testing would require.
> 3. Our tests could allow unknown contributors to take ownership of the
> agent servers (eg via the execution of bash scripts).
> 4. We have CircleCI working that provides basic testing for
> work-in-progress patches.
>
>
> Focusing on post-commit CI and having canonical results for our release
> branches, i think then it boils down to the stability and throughput of
> tests, and the persistence and permanence of results.
>
> The persistence and permanence of results is a bug bear for me. It has
> been partially addressed with posting the build results to the builds@
> ML. But this only provides a (pretty raw) summary of the results. I'm keen
> to take the next step of the posting of CI results back to committed jira
> tickets (but am waiting on seeing Jenkins run stable for a while).  If we
> had our own Jenkins master we could then look into retaining more/all build
> results. Being able to see the longer term trends of test results and well
> as execution times I hope would add the incentive to get more folk involved.
>
> Looping back to the ASF and what they could do: it would help us a lot in
> improving the stability and usability issues by providing us an isolated
> jenkins. Having our own master would simplify the setup, use and debugging,
> of Jenkins. It would still require some sunk cost but hopefully we'd end up
> with something better tailored to our needs. And with isolated agents help
> restore confidence.
>
> regards,
> Mick
>
> PS i really want to hear from those that were involved in the past with
> cassci, your skills and experience on this topic surpass anything i got.
>
>
>
> On Sun, 2 Feb 2020, at 22:51, 

Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread Mick Semb Wever
Nate, I leave it to you to forward what-you-chose to the board@'s thread.


> Are there still troubles and what are they?


TL;DR
  the ASF could provide the Cassandra community with an isolated jenkins 
installation: so that we can manage and control the Jenkins master,  as well as 
ensure all donated hardware for Jenkins agents are dedicated and isolated to us.


The long writeup…

For Cassandra's use of ASF's Jenkins I see the following problems.

** Lack of trust (aka reliability)

The Jenkins agents re-use their workspaces, as opposed to using new containers 
per test run, leading to broken agents, disks, git clones, etc. One broken test 
run, or a broken agent, too easily affects subsequent test executions.

The complexity (and flakiness) around our tests is a real problem.  CI on a 
project like Cassandra is a beast and the community is very limited in what it 
can do, it really needs the help of larger companies. Effort is required in 
fixing the broken, the flakey, and the ignored tests. Parallelising the tests 
will help by better isolating failures, but tests (and their execution scripts) 
also need to be better at cleaning up after themselves, or a more container 
approach needs to be taken.
 
Another issue is that other projects sometimes using the agents, and Infra 
sometimes edits our build configurations (out of necessity).


** Lack of resources (throughput and response)

Having only 9 agents: none of which can run the large dtests; is a problem. All 
9 are from Instaclustr, much kudos! Three companies recently have said they 
will donate resources, this is work in progress.

We have four release branches where we would like to provide per-commit 
post-commit testing. Each complete test execution currently take 24hr+. 
Parallelising tests atm won't help much as the agents are generally saturated 
(with the pipelines doing the top-level parallelisation). Once we get more 
hardware in place: for the sake of improving throughput; it will make sense to 
look into parallelising the tests more.

The throughput of tests will also improve with effort put into 
removing/rewriting long running and inefficient tests. Also, and i think this 
is LHF, throughput could be improved by using (or taking inspiration from) 
Apache Yetus so to only run tests on what it relevant in the patch/commit. Ref: 
http://yetus.apache.org/documentation/0.11.1/precommit-basic/ 


** Difficulty in use

Jenkins is clumsy to use compared to the CI systems we use more often today: 
Travis, CircleCI, GH Actions.

One of the complaints has been that only committers can kick off CI for patches 
(ie pre-commit CI runs).  But I don't believe this to be a crucial issue for a 
number of reasons. 

1. Thorough CI testing of a patch only needs to happen during the review 
process, to which a committer needs to be involved in anyway.
2.  We don't have enough jenkins agents to handle the amount of throughput that 
automated branch/patch/pull-request testing would require.
3. Our tests could allow unknown contributors to take ownership of the agent 
servers (eg via the execution of bash scripts).
4. We have CircleCI working that provides basic testing for work-in-progress 
patches.


Focusing on post-commit CI and having canonical results for our release 
branches, i think then it boils down to the stability and throughput of tests, 
and the persistence and permanence of results.

The persistence and permanence of results is a bug bear for me. It has been 
partially addressed with posting the build results to the builds@ ML. But this 
only provides a (pretty raw) summary of the results. I'm keen to take the next 
step of the posting of CI results back to committed jira tickets (but am 
waiting on seeing Jenkins run stable for a while).  If we had our own Jenkins 
master we could then look into retaining more/all build results. Being able to 
see the longer term trends of test results and well as execution times I hope 
would add the incentive to get more folk involved.

Looping back to the ASF and what they could do: it would help us a lot in 
improving the stability and usability issues by providing us an isolated 
jenkins. Having our own master would simplify the setup, use and debugging, of 
Jenkins. It would still require some sunk cost but hopefully we'd end up with 
something better tailored to our needs. And with isolated agents help restore 
confidence.

regards,
Mick

PS i really want to hear from those that were involved in the past with cassci, 
your skills and experience on this topic surpass anything i got.



On Sun, 2 Feb 2020, at 22:51, Nate McCall wrote:
> Hi folks,
> The board is looking for feedback on CI infrastructure. I'm happy to take
> some (constructive) comments back. (Shuler, Mick and David Capwell
> specifically as folks who've most recently wrestled with this a fair bit).
> 
> Thanks,
> -Nate
> 
> -- Forwarded message -
> From: Dave Fisher 
> Date: Mon, Feb 3, 2020 at 8:58 AM
> Subject: [CI] 

Re: Testing out JIRA as replacement for cwiki tracking of 4.0 quality testing

2020-02-03 Thread Joshua McKenzie
>From the people that have modified this page in the past, what are your
thoughts? Good for me to pull the rest into JIRA and we redirect from the
wiki?
+joey lynch
+scott andreas
+sumanth pasupuleti
+marcus eriksson
+romain hardouin


On Mon, Feb 3, 2020 at 8:57 AM Joshua McKenzie  wrote:

> what we really need is
>> some dedicated PM time going forward. Is that something you think you can
>> help resource from your side?
>
> Not a ton, but I think enough yes.
>
> (Also, thanks for all the efforts exploring this either way!!)
>
> Happy to help.
>
> On Sun, Feb 2, 2020 at 2:46 PM Nate McCall  wrote:
>
>> > 
>> > My .02: I think it'd improve our ability to collaborate and lower
>> friction
>> > to testing if we could do so on JIRA instead of the cwiki. *I suspect
>> *the
>> > edit access restrictions there plus general UX friction (difficult to
>> have
>> > collab discussion, comment chains, links to things, etc) make the
>> confluent
>> > wiki a worse tool for this job than JIRA. Plus if we do it in JIRA we
>> can
>> > track the outstanding scope in the single board and it's far easier to
>> > visualize everything in one place so we can all know where attention and
>> > resources need to be directed to best move the needle on things.
>> >
>> > But that's just my opinion. What does everyone else think? Like the JIRA
>> > route? Hate it? No opinion?
>> >
>> > If we do decide we want to go the epic / JIRA route, I'd be happy to
>> > migrate the rest of the information in there for things that haven't
>> been
>> > completed yet on the wiki (ticket creation, assignee/reviewer chains,
>> links
>> > to epic).
>> >
>> > So what does everyone think?
>> >
>>
>> I think this is a good idea. Having the resources available to keep the
>> various bits twiddled correctly on existing and new issues has always been
>> the hard part for us. So regardless of the path, what we really need is
>> some dedicated PM time going forward. Is that something you think you can
>> help resource from your side?
>>
>> (Also, thanks for all the efforts exploring this either way!!)
>>
>


Re: Testing out JIRA as replacement for cwiki tracking of 4.0 quality testing

2020-02-03 Thread Joshua McKenzie
>
> what we really need is
> some dedicated PM time going forward. Is that something you think you can
> help resource from your side?

Not a ton, but I think enough yes.

(Also, thanks for all the efforts exploring this either way!!)

Happy to help.

On Sun, Feb 2, 2020 at 2:46 PM Nate McCall  wrote:

> > 
> > My .02: I think it'd improve our ability to collaborate and lower
> friction
> > to testing if we could do so on JIRA instead of the cwiki. *I suspect
> *the
> > edit access restrictions there plus general UX friction (difficult to
> have
> > collab discussion, comment chains, links to things, etc) make the
> confluent
> > wiki a worse tool for this job than JIRA. Plus if we do it in JIRA we can
> > track the outstanding scope in the single board and it's far easier to
> > visualize everything in one place so we can all know where attention and
> > resources need to be directed to best move the needle on things.
> >
> > But that's just my opinion. What does everyone else think? Like the JIRA
> > route? Hate it? No opinion?
> >
> > If we do decide we want to go the epic / JIRA route, I'd be happy to
> > migrate the rest of the information in there for things that haven't been
> > completed yet on the wiki (ticket creation, assignee/reviewer chains,
> links
> > to epic).
> >
> > So what does everyone think?
> >
>
> I think this is a good idea. Having the resources available to keep the
> various bits twiddled correctly on existing and new issues has always been
> the hard part for us. So regardless of the path, what we really need is
> some dedicated PM time going forward. Is that something you think you can
> help resource from your side?
>
> (Also, thanks for all the efforts exploring this either way!!)
>