subject:"\[DISCUSS\] ci\-hbase capacity and what we'd like to see tested"

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

2022-04-20 Thread Gavin McDonald

Hi Sean, all 

Where are we with this? I havent seen any INFRA Jiras about this so far.

I'm eager to help and get this going

Gav... (ASF Infra)

On 2022/03/15 13:45:25 Sean Busbey wrote:
> It sounds like we have consensus on the general approach.
> 
> I'll file some jiras for the work. I have some time to put towards
> implementing things. If folks want to help move things along faster
> they're welcome to pitch in.
> 
> On Sun, Mar 13, 2022 at 1:35 PM Andrew Purtell  
> wrote:
> >
> > I like the idea of multi node testing, because the mini cluster does an 
> > admirable job but cannot truly emulate a production deploy because of 
> > various singletons in our code or that of our dependencies. It would also 
> > be pretty nice if k8s was the substrate — and ultimately is used to inject 
> > chaos too (via hbase-it) — because it would help us detect if we violate 
> > required discipline for that common deployment target, like inappropriate 
> > caching of DNS resolutions concurrent with pod cycling, to pick a 
> > historical example.
> >
> > Even just periodic execution of ITBLL would be nice.
> >
> > So I guess the next question is what does that require of us, the larger 
> > community. Who proposed the work? Who performs it? We should open some 
> > JIRAs to kick things off?
> >
> > > On Mar 10, 2022, at 8:32 AM, Sean Busbey  wrote:
> > >
> > > Hi folks!
> > >
> > > Quick background: all of the automated testing for nightly and PR
> > > contributions is now running on a dedicated Jenkins instance (
> > > ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> > > the ci-hadoop controller and thanks to a new anonymous donor we were
> > > able to add an additional 10 nodes.
> > >
> > > The new donor gave enough of a contribution that we can make some
> > > decisions as a community about expanding these resources further.
> > >
> > > The new 10 nodes run 2 executors each (same as our old nodes), have
> > > this shape, and are considered "medium" by the provider we're getting
> > > them from:
> > >
> > > 64GB DDR4 ECC RAM
> > > Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
> > > 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
> > >
> > > To give an idea of what the current testing workload of our project
> > > looks like, we can use the built in jenkins utilization tooling for
> > > our general purpose label 'hbase'[0].
> > >
> > > If you look at the last 4 days of utilization[1] we have a couple of
> > > periods with a small backlog of ~2 executors worth of work. The
> > > measurements are very rolled up so it's hard to tell specifics. On the
> > > chart of the last day or so[2] we can see two periods of 1-2 hours
> > > where we have a backlog of 2-4 executors worth of work.
> > >
> > > for comparison, the chart for immediately after we had to burn off ~3
> > > days of backlog because our worker nodes were offline back at the end
> > > of february shows no queue[3].
> > >
> > > I think we could possibly benefit from adding 1-2 additional medium
> > > worker nodes, but the long periods where we have ~half our executors
> > > idle makes me think some refactoring or timing changes would maybe be
> > > a better way to improve our current steady state workload.
> > >
> > > One thing that we currently lack is robust integration testing of a
> > > cluster deployment. At the moment our nightly jobs spin up a test that
> > > makes a single node version of Hadoop and then a single node hbase on
> > > top of it. It then does a trivial functionality test[4].
> > >
> > > The host provider we use for jenkins worker nodes has a large node shaped 
> > > like:
> > > 160GB RAM
> > > Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> > > 2 x 960GB NVMe drives as RAID1
> > >
> > > A pretty short path to improvement would be if we got 1 or 2 of these
> > > nodes and moved our integration test to use the minikube project[5] to
> > > run a local kubernetes environment. We could then deploy a small but
> > > multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> > > addition to whatever checking of cli commands, shell expectations,
> > > etc.
> > >
> > > What do y'all think?
> > >
> > > [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> > > [1]: 
> > > https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> > > [2]: 
> > > https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> > > [3]: 
> > > https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> > > [4]: 
> > > https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> > > [5]: https://minikube.sigs.k8s.io/docs/
>

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

2022-03-15 Thread Sean Busbey

It sounds like we have consensus on the general approach.

I'll file some jiras for the work. I have some time to put towards
implementing things. If folks want to help move things along faster
they're welcome to pitch in.

On Sun, Mar 13, 2022 at 1:35 PM Andrew Purtell  wrote:
>
> I like the idea of multi node testing, because the mini cluster does an 
> admirable job but cannot truly emulate a production deploy because of various 
> singletons in our code or that of our dependencies. It would also be pretty 
> nice if k8s was the substrate — and ultimately is used to inject chaos too 
> (via hbase-it) — because it would help us detect if we violate required 
> discipline for that common deployment target, like inappropriate caching of 
> DNS resolutions concurrent with pod cycling, to pick a historical example.
>
> Even just periodic execution of ITBLL would be nice.
>
> So I guess the next question is what does that require of us, the larger 
> community. Who proposed the work? Who performs it? We should open some JIRAs 
> to kick things off?
>
> > On Mar 10, 2022, at 8:32 AM, Sean Busbey  wrote:
> >
> > Hi folks!
> >
> > Quick background: all of the automated testing for nightly and PR
> > contributions is now running on a dedicated Jenkins instance (
> > ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> > the ci-hadoop controller and thanks to a new anonymous donor we were
> > able to add an additional 10 nodes.
> >
> > The new donor gave enough of a contribution that we can make some
> > decisions as a community about expanding these resources further.
> >
> > The new 10 nodes run 2 executors each (same as our old nodes), have
> > this shape, and are considered "medium" by the provider we're getting
> > them from:
> >
> > 64GB DDR4 ECC RAM
> > Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
> > 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
> >
> > To give an idea of what the current testing workload of our project
> > looks like, we can use the built in jenkins utilization tooling for
> > our general purpose label 'hbase'[0].
> >
> > If you look at the last 4 days of utilization[1] we have a couple of
> > periods with a small backlog of ~2 executors worth of work. The
> > measurements are very rolled up so it's hard to tell specifics. On the
> > chart of the last day or so[2] we can see two periods of 1-2 hours
> > where we have a backlog of 2-4 executors worth of work.
> >
> > for comparison, the chart for immediately after we had to burn off ~3
> > days of backlog because our worker nodes were offline back at the end
> > of february shows no queue[3].
> >
> > I think we could possibly benefit from adding 1-2 additional medium
> > worker nodes, but the long periods where we have ~half our executors
> > idle makes me think some refactoring or timing changes would maybe be
> > a better way to improve our current steady state workload.
> >
> > One thing that we currently lack is robust integration testing of a
> > cluster deployment. At the moment our nightly jobs spin up a test that
> > makes a single node version of Hadoop and then a single node hbase on
> > top of it. It then does a trivial functionality test[4].
> >
> > The host provider we use for jenkins worker nodes has a large node shaped 
> > like:
> > 160GB RAM
> > Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> > 2 x 960GB NVMe drives as RAID1
> >
> > A pretty short path to improvement would be if we got 1 or 2 of these
> > nodes and moved our integration test to use the minikube project[5] to
> > run a local kubernetes environment. We could then deploy a small but
> > multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> > addition to whatever checking of cli commands, shell expectations,
> > etc.
> >
> > What do y'all think?
> >
> > [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> > [1]: 
> > https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> > [2]: 
> > https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> > [3]: 
> > https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> > [4]: 
> > https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> > [5]: https://minikube.sigs.k8s.io/docs/

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

2022-03-14 Thread Nick Dimiduk

I think this would be a major improvement for our nightly test coverage.

Dare I ask, could we provision a fully, distributed Kubernetes cluster for
our testing?

Regardless, +1 from me on your proposal.

On Thu, Mar 10, 2022 at 5:32 PM Sean Busbey  wrote:

> Hi folks!
>
> Quick background: all of the automated testing for nightly and PR
> contributions is now running on a dedicated Jenkins instance (
> ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> the ci-hadoop controller and thanks to a new anonymous donor we were
> able to add an additional 10 nodes.
>
> The new donor gave enough of a contribution that we can make some
> decisions as a community about expanding these resources further.
>
> The new 10 nodes run 2 executors each (same as our old nodes), have
> this shape, and are considered "medium" by the provider we're getting
> them from:
>
> 64GB DDR4 ECC RAM
> Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
> 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
>
> To give an idea of what the current testing workload of our project
> looks like, we can use the built in jenkins utilization tooling for
> our general purpose label 'hbase'[0].
>
> If you look at the last 4 days of utilization[1] we have a couple of
> periods with a small backlog of ~2 executors worth of work. The
> measurements are very rolled up so it's hard to tell specifics. On the
> chart of the last day or so[2] we can see two periods of 1-2 hours
> where we have a backlog of 2-4 executors worth of work.
>
> for comparison, the chart for immediately after we had to burn off ~3
> days of backlog because our worker nodes were offline back at the end
> of february shows no queue[3].
>
> I think we could possibly benefit from adding 1-2 additional medium
> worker nodes, but the long periods where we have ~half our executors
> idle makes me think some refactoring or timing changes would maybe be
> a better way to improve our current steady state workload.
>
> One thing that we currently lack is robust integration testing of a
> cluster deployment. At the moment our nightly jobs spin up a test that
> makes a single node version of Hadoop and then a single node hbase on
> top of it. It then does a trivial functionality test[4].
>
> The host provider we use for jenkins worker nodes has a large node shaped
> like:
> 160GB RAM
> Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> 2 x 960GB NVMe drives as RAID1
>
> A pretty short path to improvement would be if we got 1 or 2 of these
> nodes and moved our integration test to use the minikube project[5] to
> run a local kubernetes environment. We could then deploy a small but
> multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> addition to whatever checking of cli commands, shell expectations,
> etc.
>
> What do y'all think?
>
> [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> [1]:
> https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> [2]:
> https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> [3]:
> https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> [4]:
> https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> [5]: https://minikube.sigs.k8s.io/docs/
>

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

2022-03-14 Thread Peter Somogyi

+1, Great idea!

On Sun, Mar 13, 2022 at 7:35 PM Andrew Purtell 
wrote:

> I like the idea of multi node testing, because the mini cluster does an
> admirable job but cannot truly emulate a production deploy because of
> various singletons in our code or that of our dependencies. It would also
> be pretty nice if k8s was the substrate — and ultimately is used to inject
> chaos too (via hbase-it) — because it would help us detect if we violate
> required discipline for that common deployment target, like inappropriate
> caching of DNS resolutions concurrent with pod cycling, to pick a
> historical example.
>
> Even just periodic execution of ITBLL would be nice.
>
> So I guess the next question is what does that require of us, the larger
> community. Who proposed the work? Who performs it? We should open some
> JIRAs to kick things off?
>
> > On Mar 10, 2022, at 8:32 AM, Sean Busbey  wrote:
> >
> > Hi folks!
> >
> > Quick background: all of the automated testing for nightly and PR
> > contributions is now running on a dedicated Jenkins instance (
> > ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> > the ci-hadoop controller and thanks to a new anonymous donor we were
> > able to add an additional 10 nodes.
> >
> > The new donor gave enough of a contribution that we can make some
> > decisions as a community about expanding these resources further.
> >
> > The new 10 nodes run 2 executors each (same as our old nodes), have
> > this shape, and are considered "medium" by the provider we're getting
> > them from:
> >
> > 64GB DDR4 ECC RAM
> > Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee
> Lake.
> > 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
> >
> > To give an idea of what the current testing workload of our project
> > looks like, we can use the built in jenkins utilization tooling for
> > our general purpose label 'hbase'[0].
> >
> > If you look at the last 4 days of utilization[1] we have a couple of
> > periods with a small backlog of ~2 executors worth of work. The
> > measurements are very rolled up so it's hard to tell specifics. On the
> > chart of the last day or so[2] we can see two periods of 1-2 hours
> > where we have a backlog of 2-4 executors worth of work.
> >
> > for comparison, the chart for immediately after we had to burn off ~3
> > days of backlog because our worker nodes were offline back at the end
> > of february shows no queue[3].
> >
> > I think we could possibly benefit from adding 1-2 additional medium
> > worker nodes, but the long periods where we have ~half our executors
> > idle makes me think some refactoring or timing changes would maybe be
> > a better way to improve our current steady state workload.
> >
> > One thing that we currently lack is robust integration testing of a
> > cluster deployment. At the moment our nightly jobs spin up a test that
> > makes a single node version of Hadoop and then a single node hbase on
> > top of it. It then does a trivial functionality test[4].
> >
> > The host provider we use for jenkins worker nodes has a large node
> shaped like:
> > 160GB RAM
> > Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> > 2 x 960GB NVMe drives as RAID1
> >
> > A pretty short path to improvement would be if we got 1 or 2 of these
> > nodes and moved our integration test to use the minikube project[5] to
> > run a local kubernetes environment. We could then deploy a small but
> > multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> > addition to whatever checking of cli commands, shell expectations,
> > etc.
> >
> > What do y'all think?
> >
> > [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> > [1]:
> https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> > [2]:
> https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> > [3]:
> https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> > [4]:
> https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> > [5]: https://minikube.sigs.k8s.io/docs/
>

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

2022-03-13 Thread Andrew Purtell

I like the idea of multi node testing, because the mini cluster does an 
admirable job but cannot truly emulate a production deploy because of various 
singletons in our code or that of our dependencies. It would also be pretty 
nice if k8s was the substrate — and ultimately is used to inject chaos too (via 
hbase-it) — because it would help us detect if we violate required discipline 
for that common deployment target, like inappropriate caching of DNS 
resolutions concurrent with pod cycling, to pick a historical example.

Even just periodic execution of ITBLL would be nice. 

So I guess the next question is what does that require of us, the larger 
community. Who proposed the work? Who performs it? We should open some JIRAs to 
kick things off? 

> On Mar 10, 2022, at 8:32 AM, Sean Busbey  wrote:
> 
> Hi folks!
> 
> Quick background: all of the automated testing for nightly and PR
> contributions is now running on a dedicated Jenkins instance (
> ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> the ci-hadoop controller and thanks to a new anonymous donor we were
> able to add an additional 10 nodes.
> 
> The new donor gave enough of a contribution that we can make some
> decisions as a community about expanding these resources further.
> 
> The new 10 nodes run 2 executors each (same as our old nodes), have
> this shape, and are considered "medium" by the provider we're getting
> them from:
> 
> 64GB DDR4 ECC RAM
> Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
> 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
> 
> To give an idea of what the current testing workload of our project
> looks like, we can use the built in jenkins utilization tooling for
> our general purpose label 'hbase'[0].
> 
> If you look at the last 4 days of utilization[1] we have a couple of
> periods with a small backlog of ~2 executors worth of work. The
> measurements are very rolled up so it's hard to tell specifics. On the
> chart of the last day or so[2] we can see two periods of 1-2 hours
> where we have a backlog of 2-4 executors worth of work.
> 
> for comparison, the chart for immediately after we had to burn off ~3
> days of backlog because our worker nodes were offline back at the end
> of february shows no queue[3].
> 
> I think we could possibly benefit from adding 1-2 additional medium
> worker nodes, but the long periods where we have ~half our executors
> idle makes me think some refactoring or timing changes would maybe be
> a better way to improve our current steady state workload.
> 
> One thing that we currently lack is robust integration testing of a
> cluster deployment. At the moment our nightly jobs spin up a test that
> makes a single node version of Hadoop and then a single node hbase on
> top of it. It then does a trivial functionality test[4].
> 
> The host provider we use for jenkins worker nodes has a large node shaped 
> like:
> 160GB RAM
> Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> 2 x 960GB NVMe drives as RAID1
> 
> A pretty short path to improvement would be if we got 1 or 2 of these
> nodes and moved our integration test to use the minikube project[5] to
> run a local kubernetes environment. We could then deploy a small but
> multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> addition to whatever checking of cli commands, shell expectations,
> etc.
> 
> What do y'all think?
> 
> [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> [1]: 
> https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> [2]: 
> https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> [3]: 
> https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> [4]: 
> https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> [5]: https://minikube.sigs.k8s.io/docs/

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

2022-03-13 Thread Duo Zhang

I'm +1 on having some nodes to run ITBLL. This could increase the stability
a lot.

Sean Busbey  于2022年3月11日周五 00:32写道：

> Hi folks!
>
> Quick background: all of the automated testing for nightly and PR
> contributions is now running on a dedicated Jenkins instance (
> ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> the ci-hadoop controller and thanks to a new anonymous donor we were
> able to add an additional 10 nodes.
>
> The new donor gave enough of a contribution that we can make some
> decisions as a community about expanding these resources further.
>
> The new 10 nodes run 2 executors each (same as our old nodes), have
> this shape, and are considered "medium" by the provider we're getting
> them from:
>
> 64GB DDR4 ECC RAM
> Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
> 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
>
> To give an idea of what the current testing workload of our project
> looks like, we can use the built in jenkins utilization tooling for
> our general purpose label 'hbase'[0].
>
> If you look at the last 4 days of utilization[1] we have a couple of
> periods with a small backlog of ~2 executors worth of work. The
> measurements are very rolled up so it's hard to tell specifics. On the
> chart of the last day or so[2] we can see two periods of 1-2 hours
> where we have a backlog of 2-4 executors worth of work.
>
> for comparison, the chart for immediately after we had to burn off ~3
> days of backlog because our worker nodes were offline back at the end
> of february shows no queue[3].
>
> I think we could possibly benefit from adding 1-2 additional medium
> worker nodes, but the long periods where we have ~half our executors
> idle makes me think some refactoring or timing changes would maybe be
> a better way to improve our current steady state workload.
>
> One thing that we currently lack is robust integration testing of a
> cluster deployment. At the moment our nightly jobs spin up a test that
> makes a single node version of Hadoop and then a single node hbase on
> top of it. It then does a trivial functionality test[4].
>
> The host provider we use for jenkins worker nodes has a large node shaped
> like:
> 160GB RAM
> Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> 2 x 960GB NVMe drives as RAID1
>
> A pretty short path to improvement would be if we got 1 or 2 of these
> nodes and moved our integration test to use the minikube project[5] to
> run a local kubernetes environment. We could then deploy a small but
> multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> addition to whatever checking of cli commands, shell expectations,
> etc.
>
> What do y'all think?
>
> [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> [1]:
> https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> [2]:
> https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> [3]:
> https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> [4]:
> https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> [5]: https://minikube.sigs.k8s.io/docs/
>

[DISCUSS] ci-hbase capacity and what we'd like to see tested

2022-03-10 Thread Sean Busbey

Hi folks!

Quick background: all of the automated testing for nightly and PR
contributions is now running on a dedicated Jenkins instance (
ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
the ci-hadoop controller and thanks to a new anonymous donor we were
able to add an additional 10 nodes.

The new donor gave enough of a contribution that we can make some
decisions as a community about expanding these resources further.

The new 10 nodes run 2 executors each (same as our old nodes), have
this shape, and are considered "medium" by the provider we're getting
them from:

64GB DDR4 ECC RAM
Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)

To give an idea of what the current testing workload of our project
looks like, we can use the built in jenkins utilization tooling for
our general purpose label 'hbase'[0].

If you look at the last 4 days of utilization[1] we have a couple of
periods with a small backlog of ~2 executors worth of work. The
measurements are very rolled up so it's hard to tell specifics. On the
chart of the last day or so[2] we can see two periods of 1-2 hours
where we have a backlog of 2-4 executors worth of work.

for comparison, the chart for immediately after we had to burn off ~3
days of backlog because our worker nodes were offline back at the end
of february shows no queue[3].

I think we could possibly benefit from adding 1-2 additional medium
worker nodes, but the long periods where we have ~half our executors
idle makes me think some refactoring or timing changes would maybe be
a better way to improve our current steady state workload.

One thing that we currently lack is robust integration testing of a
cluster deployment. At the moment our nightly jobs spin up a test that
makes a single node version of Hadoop and then a single node hbase on
top of it. It then does a trivial functionality test[4].

The host provider we use for jenkins worker nodes has a large node shaped like:
160GB RAM
Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
2 x 960GB NVMe drives as RAID1

A pretty short path to improvement would be if we got 1 or 2 of these
nodes and moved our integration test to use the minikube project[5] to
run a local kubernetes environment. We could then deploy a small but
multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
addition to whatever checking of cli commands, shell expectations,
etc.

What do y'all think?

[0]: https://ci-hbase.apache.org/label/hbase/load-statistics
[1]:
https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
[2]:
https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
[3]:
https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
[4]:
https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
[5]: https://minikube.sigs.k8s.io/docs/

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

[DISCUSS] ci-hbase capacity and what we'd like to see tested

7 matches

Site Navigation

Mail list logo

Footer information