Re: Replication-related IT failures
Update: since it has been 3 weeks without indication of a maintainer picking up responsibility for this feature and its tests, I've opened a pull request to disable the related ITs (https://github.com/apache/accumulo/pull/1524). I have no intention of advocating for removal of the feature until at least 3.0 (whenever that is, but probably not any time soon), but in fairness to users, it may be beneficial to call out the feature in release notes as being in need of a new maintainer, for the next few releases, if the status doesn't change before release. On Fri, Jan 31, 2020 at 7:07 PM Christopher wrote: > > On Fri, Jan 31, 2020 at 10:30 AM Josh Elser wrote: > > > > I'm really upset that you think suggesting removal of the feature is > > appropriate. > > I'm sorry you are upset. > > If it's any comfort at all, I've already stated that I think it's > premature to discuss removal. So, my intent isn't to advocate for > that. Rather, my intent is to honestly communicate the current state > and the stakes that *could* come to pass *if* that feature is left > unmaintained. I think it's important to communicate that as a risk, so > that people who have an interest in the feature understand how > critical it is (to their own interests) for somebody (perhaps them) to > step in and maintain it, given the current situation they've just been > informed about. > > > > > > More installations than not of HBase (IMO which should be considered > > Accumulo's biggest competitor) use replication. The only users of HBase > > I see who without a disaster recovery plan is developer-focused > > instances with zero uptime guarantees. I'll even go farther to say: any > > user who deploys a database into a production scenario would *require* a > > D/R solution for that database before it would be allowed to be called > > "production". > > > > Yes, there are D/R solutions that can be implemented at the data > > processing layer, but this is almost always less ideal as the cost of > > reprocessing and shipping the raw data is much greater than what > > Accumulo replication could do. > > From my perspective, this isn't about the replication feature at all. > It's about resources to maintain a feature that is suffering from > technical debt. If there's nobody to maintain it, and it starts > interfering with the maintenance of the rest of Accumulo, what can we > do? I feel like our options are limited... but for now, my goal is > just to get this on potential contributors' radars by talking about > it. > > > > > While I am deflated that no other developers have seen this and have any > > interest in helping work through bugs/issues, they are volunteers and I > > can only be sad about this. However, I will not let an argument which > > equates to "we should junk the car because it has a flat tire" go > > without response. > > I'm not saying we should "junk the car" just yet. What I am saying is > "If nobody fixes that tire, we're going to have to decide in the > future what to do with it, because we can't keep driving a car with a > flat tire... it will keep doing damage to the car and putting its > passengers' lives at risk". But, first and foremost, I'm saying "Hey, > we have a flat tire! Can anybody fix it?". > > > > > > On 1/28/20 10:58 PM, Christopher wrote: > > > As succinctly as I can: > > > > > > 1. Replication-related IT have been flakey for a long time, > > > 2. The feature is not actively maintained (critical, or at least, > > > untriaged issues exist dating back to 2014 in JIRA), > > > 3. No volunteers have stepped up thus far to maintain them and make > > > them reliable or to develop/maintain replication, > > > 4. I don't have time to fix the flakey ITs, and don't have interest or > > > use case for maintaining the feature, > > > 5. The IT breakages interfere with build testing on CI servers and for > > > releases. > > > > > > Therefore: > > > > > > A. I want to @Ignore the flakey ITs, so they don't keep interfering > > > with test builds, > > > B. We can re-enable the ITs if/when a volunteer contributes > > > reliability fixes for them, > > > C. If nobody steps up, we should have a separate conversation about > > > possibly phasing out the feature and what that would look like. > > > > > > The conversation I suggest in "C" is a bit premature right now. I'm > > > starting with this email to see if any volunteers want to step up. > > > > > > Even if somebody steps up immediately, they may not have a fix > > > immediately. So, if there's no objections, I'm going to disable the > > > flakey tests soon by adding the '@Ignore' JUnit annotation until a fix > > > is contributed, so they don't keep getting in the way of > > > troubleshooting other build-related issues. We already know they are > > > flakey... the constant failures aren't telling us anything new, so the > > > tests aren't useful as is. > > >
Re: Replication-related IT failures
On Fri, Jan 31, 2020 at 10:30 AM Josh Elser wrote: > > I'm really upset that you think suggesting removal of the feature is > appropriate. I'm sorry you are upset. If it's any comfort at all, I've already stated that I think it's premature to discuss removal. So, my intent isn't to advocate for that. Rather, my intent is to honestly communicate the current state and the stakes that *could* come to pass *if* that feature is left unmaintained. I think it's important to communicate that as a risk, so that people who have an interest in the feature understand how critical it is (to their own interests) for somebody (perhaps them) to step in and maintain it, given the current situation they've just been informed about. > > More installations than not of HBase (IMO which should be considered > Accumulo's biggest competitor) use replication. The only users of HBase > I see who without a disaster recovery plan is developer-focused > instances with zero uptime guarantees. I'll even go farther to say: any > user who deploys a database into a production scenario would *require* a > D/R solution for that database before it would be allowed to be called > "production". > > Yes, there are D/R solutions that can be implemented at the data > processing layer, but this is almost always less ideal as the cost of > reprocessing and shipping the raw data is much greater than what > Accumulo replication could do. >From my perspective, this isn't about the replication feature at all. It's about resources to maintain a feature that is suffering from technical debt. If there's nobody to maintain it, and it starts interfering with the maintenance of the rest of Accumulo, what can we do? I feel like our options are limited... but for now, my goal is just to get this on potential contributors' radars by talking about it. > > While I am deflated that no other developers have seen this and have any > interest in helping work through bugs/issues, they are volunteers and I > can only be sad about this. However, I will not let an argument which > equates to "we should junk the car because it has a flat tire" go > without response. I'm not saying we should "junk the car" just yet. What I am saying is "If nobody fixes that tire, we're going to have to decide in the future what to do with it, because we can't keep driving a car with a flat tire... it will keep doing damage to the car and putting its passengers' lives at risk". But, first and foremost, I'm saying "Hey, we have a flat tire! Can anybody fix it?". > > On 1/28/20 10:58 PM, Christopher wrote: > > As succinctly as I can: > > > > 1. Replication-related IT have been flakey for a long time, > > 2. The feature is not actively maintained (critical, or at least, > > untriaged issues exist dating back to 2014 in JIRA), > > 3. No volunteers have stepped up thus far to maintain them and make > > them reliable or to develop/maintain replication, > > 4. I don't have time to fix the flakey ITs, and don't have interest or > > use case for maintaining the feature, > > 5. The IT breakages interfere with build testing on CI servers and for > > releases. > > > > Therefore: > > > > A. I want to @Ignore the flakey ITs, so they don't keep interfering > > with test builds, > > B. We can re-enable the ITs if/when a volunteer contributes > > reliability fixes for them, > > C. If nobody steps up, we should have a separate conversation about > > possibly phasing out the feature and what that would look like. > > > > The conversation I suggest in "C" is a bit premature right now. I'm > > starting with this email to see if any volunteers want to step up. > > > > Even if somebody steps up immediately, they may not have a fix > > immediately. So, if there's no objections, I'm going to disable the > > flakey tests soon by adding the '@Ignore' JUnit annotation until a fix > > is contributed, so they don't keep getting in the way of > > troubleshooting other build-related issues. We already know they are > > flakey... the constant failures aren't telling us anything new, so the > > tests aren't useful as is. > >
Re: Replication-related IT failures
I see the value of having the replication system remain intact in Accumulo and would vote to keep. From my personal experience of using it, it works well but ultimately ended up disabling replication in production due to the high latency. It is still used in lower non-production environments where latency is less of a concern. Additionally, since we don't know the full user base of Accumulo, I cannot personally recommend the feature be phased out. As far as the flakey integration test are concerned, if no one steps up to work on them, I am +1 on adding @Ignore. On Fri, Jan 31, 2020 at 10:30 AM Josh Elser wrote: > I'm really upset that you think suggesting removal of the feature is > appropriate. > > More installations than not of HBase (IMO which should be considered > Accumulo's biggest competitor) use replication. The only users of HBase > I see who without a disaster recovery plan is developer-focused > instances with zero uptime guarantees. I'll even go farther to say: any > user who deploys a database into a production scenario would *require* a > D/R solution for that database before it would be allowed to be called > "production". > > Yes, there are D/R solutions that can be implemented at the data > processing layer, but this is almost always less ideal as the cost of > reprocessing and shipping the raw data is much greater than what > Accumulo replication could do. > > While I am deflated that no other developers have seen this and have any > interest in helping work through bugs/issues, they are volunteers and I > can only be sad about this. However, I will not let an argument which > equates to "we should junk the car because it has a flat tire" go > without response. > > On 1/28/20 10:58 PM, Christopher wrote: > > As succinctly as I can: > > > > 1. Replication-related IT have been flakey for a long time, > > 2. The feature is not actively maintained (critical, or at least, > > untriaged issues exist dating back to 2014 in JIRA), > > 3. No volunteers have stepped up thus far to maintain them and make > > them reliable or to develop/maintain replication, > > 4. I don't have time to fix the flakey ITs, and don't have interest or > > use case for maintaining the feature, > > 5. The IT breakages interfere with build testing on CI servers and for > releases. > > > > Therefore: > > > > A. I want to @Ignore the flakey ITs, so they don't keep interfering > > with test builds, > > B. We can re-enable the ITs if/when a volunteer contributes > > reliability fixes for them, > > C. If nobody steps up, we should have a separate conversation about > > possibly phasing out the feature and what that would look like. > > > > The conversation I suggest in "C" is a bit premature right now. I'm > > starting with this email to see if any volunteers want to step up. > > > > Even if somebody steps up immediately, they may not have a fix > > immediately. So, if there's no objections, I'm going to disable the > > flakey tests soon by adding the '@Ignore' JUnit annotation until a fix > > is contributed, so they don't keep getting in the way of > > troubleshooting other build-related issues. We already know they are > > flakey... the constant failures aren't telling us anything new, so the > > tests aren't useful as is. > > >
Re: Replication-related IT failures
I'm really upset that you think suggesting removal of the feature is appropriate. More installations than not of HBase (IMO which should be considered Accumulo's biggest competitor) use replication. The only users of HBase I see who without a disaster recovery plan is developer-focused instances with zero uptime guarantees. I'll even go farther to say: any user who deploys a database into a production scenario would *require* a D/R solution for that database before it would be allowed to be called "production". Yes, there are D/R solutions that can be implemented at the data processing layer, but this is almost always less ideal as the cost of reprocessing and shipping the raw data is much greater than what Accumulo replication could do. While I am deflated that no other developers have seen this and have any interest in helping work through bugs/issues, they are volunteers and I can only be sad about this. However, I will not let an argument which equates to "we should junk the car because it has a flat tire" go without response. On 1/28/20 10:58 PM, Christopher wrote: As succinctly as I can: 1. Replication-related IT have been flakey for a long time, 2. The feature is not actively maintained (critical, or at least, untriaged issues exist dating back to 2014 in JIRA), 3. No volunteers have stepped up thus far to maintain them and make them reliable or to develop/maintain replication, 4. I don't have time to fix the flakey ITs, and don't have interest or use case for maintaining the feature, 5. The IT breakages interfere with build testing on CI servers and for releases. Therefore: A. I want to @Ignore the flakey ITs, so they don't keep interfering with test builds, B. We can re-enable the ITs if/when a volunteer contributes reliability fixes for them, C. If nobody steps up, we should have a separate conversation about possibly phasing out the feature and what that would look like. The conversation I suggest in "C" is a bit premature right now. I'm starting with this email to see if any volunteers want to step up. Even if somebody steps up immediately, they may not have a fix immediately. So, if there's no objections, I'm going to disable the flakey tests soon by adding the '@Ignore' JUnit annotation until a fix is contributed, so they don't keep getting in the way of troubleshooting other build-related issues. We already know they are flakey... the constant failures aren't telling us anything new, so the tests aren't useful as is.
Replication-related IT failures
As succinctly as I can: 1. Replication-related IT have been flakey for a long time, 2. The feature is not actively maintained (critical, or at least, untriaged issues exist dating back to 2014 in JIRA), 3. No volunteers have stepped up thus far to maintain them and make them reliable or to develop/maintain replication, 4. I don't have time to fix the flakey ITs, and don't have interest or use case for maintaining the feature, 5. The IT breakages interfere with build testing on CI servers and for releases. Therefore: A. I want to @Ignore the flakey ITs, so they don't keep interfering with test builds, B. We can re-enable the ITs if/when a volunteer contributes reliability fixes for them, C. If nobody steps up, we should have a separate conversation about possibly phasing out the feature and what that would look like. The conversation I suggest in "C" is a bit premature right now. I'm starting with this email to see if any volunteers want to step up. Even if somebody steps up immediately, they may not have a fix immediately. So, if there's no objections, I'm going to disable the flakey tests soon by adding the '@Ignore' JUnit annotation until a fix is contributed, so they don't keep getting in the way of troubleshooting other build-related issues. We already know they are flakey... the constant failures aren't telling us anything new, so the tests aren't useful as is.