Re: Replication-related IT failures

2020-02-19 Thread Christopher
Update: since it has been 3 weeks without indication of a maintainer
picking up responsibility for this feature and its tests, I've opened
a pull request to disable the related ITs
(https://github.com/apache/accumulo/pull/1524). I have no intention of
advocating for removal of the feature until at least 3.0 (whenever
that is, but probably not any time soon), but in fairness to users, it
may be beneficial to call out the feature in release notes as being in
need of a new maintainer, for the next few releases, if the status
doesn't change before release.

On Fri, Jan 31, 2020 at 7:07 PM Christopher  wrote:
>
> On Fri, Jan 31, 2020 at 10:30 AM Josh Elser  wrote:
> >
> > I'm really upset that you think suggesting removal of the feature is
> > appropriate.
>
> I'm sorry you are upset.
>
> If it's any comfort at all, I've already stated that I think it's
> premature to discuss removal. So, my intent isn't to advocate for
> that. Rather, my intent is to honestly communicate the current state
> and the stakes that *could* come to pass *if* that feature is left
> unmaintained. I think it's important to communicate that as a risk, so
> that people who have an interest in the feature understand how
> critical it is (to their own interests) for somebody (perhaps them) to
> step in and maintain it, given the current situation they've just been
> informed about.
>
>
> >
> > More installations than not of HBase (IMO which should be considered
> > Accumulo's biggest competitor) use replication. The only users of HBase
> > I see who without a disaster recovery plan is developer-focused
> > instances with zero uptime guarantees. I'll even go farther to say: any
> > user who deploys a database into a production scenario would *require* a
> > D/R solution for that database before it would be allowed to be called
> > "production".
> >
> > Yes, there are D/R solutions that can be implemented at the data
> > processing layer, but this is almost always less ideal as the cost of
> > reprocessing and shipping the raw data is much greater than what
> > Accumulo replication could do.
>
> From my perspective, this isn't about the replication feature at all.
> It's about resources to maintain a feature that is suffering from
> technical debt. If there's nobody to maintain it, and it starts
> interfering with the maintenance of the rest of Accumulo, what can we
> do? I feel like our options are limited... but for now, my goal is
> just to get this on potential contributors' radars by talking about
> it.
>
> >
> > While I am deflated that no other developers have seen this and have any
> > interest in helping work through bugs/issues, they are volunteers and I
> > can only be sad about this. However, I will not let an argument which
> > equates to "we should junk the car because it has a flat tire" go
> > without response.
>
> I'm not saying we should "junk the car" just yet. What I am saying is
> "If nobody fixes that tire, we're going to have to decide in the
> future what to do with it, because we can't keep driving a car with a
> flat tire... it will keep doing damage to the car and putting its
> passengers' lives at risk". But, first and foremost, I'm saying "Hey,
> we have a flat tire! Can anybody fix it?".
>
>
> >
> > On 1/28/20 10:58 PM, Christopher wrote:
> > > As succinctly as I can:
> > >
> > > 1. Replication-related IT have been flakey for a long time,
> > > 2. The feature is not actively maintained (critical, or at least,
> > > untriaged issues exist dating back to 2014 in JIRA),
> > > 3. No volunteers have stepped up thus far to maintain them and make
> > > them reliable or to develop/maintain replication,
> > > 4. I don't have time to fix the flakey ITs, and don't have interest or
> > > use case for maintaining the feature,
> > > 5. The IT breakages interfere with build testing on CI servers and for 
> > > releases.
> > >
> > > Therefore:
> > >
> > > A. I want to @Ignore the flakey ITs, so they don't keep interfering
> > > with test builds,
> > > B. We can re-enable the ITs if/when a volunteer contributes
> > > reliability fixes for them,
> > > C. If nobody steps up, we should have a separate conversation about
> > > possibly phasing out the feature and what that would look like.
> > >
> > > The conversation I suggest in "C" is a bit premature right now. I'm
> > > starting with this email to see if any volunteers want to step up.
> > >
> > > Even if somebody steps up immediately, they may not have a fix
> > > immediately. So, if there's no objections, I'm going to disable the
> > > flakey tests soon by adding the '@Ignore' JUnit annotation until a fix
> > > is contributed, so they don't keep getting in the way of
> > > troubleshooting other build-related issues. We already know they are
> > > flakey... the constant failures aren't telling us anything new, so the
> > > tests aren't useful as is.
> > >


Re: Replication-related IT failures

2020-01-31 Thread Christopher
On Fri, Jan 31, 2020 at 10:30 AM Josh Elser  wrote:
>
> I'm really upset that you think suggesting removal of the feature is
> appropriate.

I'm sorry you are upset.

If it's any comfort at all, I've already stated that I think it's
premature to discuss removal. So, my intent isn't to advocate for
that. Rather, my intent is to honestly communicate the current state
and the stakes that *could* come to pass *if* that feature is left
unmaintained. I think it's important to communicate that as a risk, so
that people who have an interest in the feature understand how
critical it is (to their own interests) for somebody (perhaps them) to
step in and maintain it, given the current situation they've just been
informed about.


>
> More installations than not of HBase (IMO which should be considered
> Accumulo's biggest competitor) use replication. The only users of HBase
> I see who without a disaster recovery plan is developer-focused
> instances with zero uptime guarantees. I'll even go farther to say: any
> user who deploys a database into a production scenario would *require* a
> D/R solution for that database before it would be allowed to be called
> "production".
>
> Yes, there are D/R solutions that can be implemented at the data
> processing layer, but this is almost always less ideal as the cost of
> reprocessing and shipping the raw data is much greater than what
> Accumulo replication could do.

>From my perspective, this isn't about the replication feature at all.
It's about resources to maintain a feature that is suffering from
technical debt. If there's nobody to maintain it, and it starts
interfering with the maintenance of the rest of Accumulo, what can we
do? I feel like our options are limited... but for now, my goal is
just to get this on potential contributors' radars by talking about
it.

>
> While I am deflated that no other developers have seen this and have any
> interest in helping work through bugs/issues, they are volunteers and I
> can only be sad about this. However, I will not let an argument which
> equates to "we should junk the car because it has a flat tire" go
> without response.

I'm not saying we should "junk the car" just yet. What I am saying is
"If nobody fixes that tire, we're going to have to decide in the
future what to do with it, because we can't keep driving a car with a
flat tire... it will keep doing damage to the car and putting its
passengers' lives at risk". But, first and foremost, I'm saying "Hey,
we have a flat tire! Can anybody fix it?".


>
> On 1/28/20 10:58 PM, Christopher wrote:
> > As succinctly as I can:
> >
> > 1. Replication-related IT have been flakey for a long time,
> > 2. The feature is not actively maintained (critical, or at least,
> > untriaged issues exist dating back to 2014 in JIRA),
> > 3. No volunteers have stepped up thus far to maintain them and make
> > them reliable or to develop/maintain replication,
> > 4. I don't have time to fix the flakey ITs, and don't have interest or
> > use case for maintaining the feature,
> > 5. The IT breakages interfere with build testing on CI servers and for 
> > releases.
> >
> > Therefore:
> >
> > A. I want to @Ignore the flakey ITs, so they don't keep interfering
> > with test builds,
> > B. We can re-enable the ITs if/when a volunteer contributes
> > reliability fixes for them,
> > C. If nobody steps up, we should have a separate conversation about
> > possibly phasing out the feature and what that would look like.
> >
> > The conversation I suggest in "C" is a bit premature right now. I'm
> > starting with this email to see if any volunteers want to step up.
> >
> > Even if somebody steps up immediately, they may not have a fix
> > immediately. So, if there's no objections, I'm going to disable the
> > flakey tests soon by adding the '@Ignore' JUnit annotation until a fix
> > is contributed, so they don't keep getting in the way of
> > troubleshooting other build-related issues. We already know they are
> > flakey... the constant failures aren't telling us anything new, so the
> > tests aren't useful as is.
> >


Re: Replication-related IT failures

2020-01-31 Thread Adam J. Shook
I see the value of having the replication system remain intact in Accumulo
and would vote to keep.  From my personal experience of using it, it works
well but ultimately ended up disabling replication in production due to the
high latency.  It is still used in lower non-production environments where
latency is less of a concern.  Additionally, since we don't know the full
user base of Accumulo, I cannot personally recommend the feature be phased
out.

As far as the flakey integration test are concerned, if no one steps up to
work on them, I am +1 on adding @Ignore.

On Fri, Jan 31, 2020 at 10:30 AM Josh Elser  wrote:

> I'm really upset that you think suggesting removal of the feature is
> appropriate.
>
> More installations than not of HBase (IMO which should be considered
> Accumulo's biggest competitor) use replication. The only users of HBase
> I see who without a disaster recovery plan is developer-focused
> instances with zero uptime guarantees. I'll even go farther to say: any
> user who deploys a database into a production scenario would *require* a
> D/R solution for that database before it would be allowed to be called
> "production".
>
> Yes, there are D/R solutions that can be implemented at the data
> processing layer, but this is almost always less ideal as the cost of
> reprocessing and shipping the raw data is much greater than what
> Accumulo replication could do.
>
> While I am deflated that no other developers have seen this and have any
> interest in helping work through bugs/issues, they are volunteers and I
> can only be sad about this. However, I will not let an argument which
> equates to "we should junk the car because it has a flat tire" go
> without response.
>
> On 1/28/20 10:58 PM, Christopher wrote:
> > As succinctly as I can:
> >
> > 1. Replication-related IT have been flakey for a long time,
> > 2. The feature is not actively maintained (critical, or at least,
> > untriaged issues exist dating back to 2014 in JIRA),
> > 3. No volunteers have stepped up thus far to maintain them and make
> > them reliable or to develop/maintain replication,
> > 4. I don't have time to fix the flakey ITs, and don't have interest or
> > use case for maintaining the feature,
> > 5. The IT breakages interfere with build testing on CI servers and for
> releases.
> >
> > Therefore:
> >
> > A. I want to @Ignore the flakey ITs, so they don't keep interfering
> > with test builds,
> > B. We can re-enable the ITs if/when a volunteer contributes
> > reliability fixes for them,
> > C. If nobody steps up, we should have a separate conversation about
> > possibly phasing out the feature and what that would look like.
> >
> > The conversation I suggest in "C" is a bit premature right now. I'm
> > starting with this email to see if any volunteers want to step up.
> >
> > Even if somebody steps up immediately, they may not have a fix
> > immediately. So, if there's no objections, I'm going to disable the
> > flakey tests soon by adding the '@Ignore' JUnit annotation until a fix
> > is contributed, so they don't keep getting in the way of
> > troubleshooting other build-related issues. We already know they are
> > flakey... the constant failures aren't telling us anything new, so the
> > tests aren't useful as is.
> >
>


Re: Replication-related IT failures

2020-01-31 Thread Josh Elser
I'm really upset that you think suggesting removal of the feature is 
appropriate.


More installations than not of HBase (IMO which should be considered 
Accumulo's biggest competitor) use replication. The only users of HBase 
I see who without a disaster recovery plan is developer-focused 
instances with zero uptime guarantees. I'll even go farther to say: any 
user who deploys a database into a production scenario would *require* a 
D/R solution for that database before it would be allowed to be called 
"production".


Yes, there are D/R solutions that can be implemented at the data 
processing layer, but this is almost always less ideal as the cost of 
reprocessing and shipping the raw data is much greater than what 
Accumulo replication could do.


While I am deflated that no other developers have seen this and have any 
interest in helping work through bugs/issues, they are volunteers and I 
can only be sad about this. However, I will not let an argument which 
equates to "we should junk the car because it has a flat tire" go 
without response.


On 1/28/20 10:58 PM, Christopher wrote:

As succinctly as I can:

1. Replication-related IT have been flakey for a long time,
2. The feature is not actively maintained (critical, or at least,
untriaged issues exist dating back to 2014 in JIRA),
3. No volunteers have stepped up thus far to maintain them and make
them reliable or to develop/maintain replication,
4. I don't have time to fix the flakey ITs, and don't have interest or
use case for maintaining the feature,
5. The IT breakages interfere with build testing on CI servers and for releases.

Therefore:

A. I want to @Ignore the flakey ITs, so they don't keep interfering
with test builds,
B. We can re-enable the ITs if/when a volunteer contributes
reliability fixes for them,
C. If nobody steps up, we should have a separate conversation about
possibly phasing out the feature and what that would look like.

The conversation I suggest in "C" is a bit premature right now. I'm
starting with this email to see if any volunteers want to step up.

Even if somebody steps up immediately, they may not have a fix
immediately. So, if there's no objections, I'm going to disable the
flakey tests soon by adding the '@Ignore' JUnit annotation until a fix
is contributed, so they don't keep getting in the way of
troubleshooting other build-related issues. We already know they are
flakey... the constant failures aren't telling us anything new, so the
tests aren't useful as is.



Replication-related IT failures

2020-01-28 Thread Christopher
As succinctly as I can:

1. Replication-related IT have been flakey for a long time,
2. The feature is not actively maintained (critical, or at least,
untriaged issues exist dating back to 2014 in JIRA),
3. No volunteers have stepped up thus far to maintain them and make
them reliable or to develop/maintain replication,
4. I don't have time to fix the flakey ITs, and don't have interest or
use case for maintaining the feature,
5. The IT breakages interfere with build testing on CI servers and for releases.

Therefore:

A. I want to @Ignore the flakey ITs, so they don't keep interfering
with test builds,
B. We can re-enable the ITs if/when a volunteer contributes
reliability fixes for them,
C. If nobody steps up, we should have a separate conversation about
possibly phasing out the feature and what that would look like.

The conversation I suggest in "C" is a bit premature right now. I'm
starting with this email to see if any volunteers want to step up.

Even if somebody steps up immediately, they may not have a fix
immediately. So, if there's no objections, I'm going to disable the
flakey tests soon by adding the '@Ignore' JUnit annotation until a fix
is contributed, so they don't keep getting in the way of
troubleshooting other build-related issues. We already know they are
flakey... the constant failures aren't telling us anything new, so the
tests aren't useful as is.