Re: [Discuss] Repair inside C*

2024-02-25 Thread Jaydeep Chovatia
Thanks, Josh. I've just updated the CEP

and included all the solutions you mentioned below.

Jaydeep

On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie  wrote:

> Very late response from me here (basically necro'ing this thread).
>
> I think it'd be useful to get this condensed into a CEP that we can then
> discuss in that format. It's clearly something we all agree we need and
> having an implementation that works, even if it's not in your preferred
> execution domain, is vastly better than nothing IMO.
>
> I don't have cycles (nor background ;) ) to do that, but it sounds like
> you do Jaydeep given the implementation you have on a private fork + design.
>
> A non-exhaustive list of things that might be useful incorporating into or
> referencing from a CEP:
> Slack thread:
> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> Joey's old C* ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-14346
> Even older automatic repair scheduling:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Your design gdoc:
> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
> PR with automated repair:
> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>
> My intuition is that we're all basically in agreement that this is
> something the DB needs, we're all willing to bikeshed for our personal
> preference on where it lives and how it's implemented, and at the end of
> the day, code talks. I don't think anyone's said they'll die on the hill of
> implementation details, so that feels like CEP time to me.
>
> If you were willing and able to get a CEP together for automated repair
> based on the above material, given you've done the work and have the proof
> points it's working at scale, I think this would be a *huge contribution*
> to the community.
>
> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
>
> Is anyone going to file an official CEP for this?
> As mentioned in this email thread, here is one of the solution's design
> doc
> 
> and source code on a private Apache Cassandra patch. Could you go through
> it and let me know what you think?
>
> Jaydeep
>
> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad 
> wrote:
>
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
>
> This is something I hadn't thought much about, and is a pretty good
> argument for using the sidecar initially.  There's a lot of deployments out
> there and having an official repair option would be a big win.
>
>
> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
> > I agree that it would be ideal for Cassandra to have a repair scheduler
> in-DB.
> >
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
> >
> > Once TCM has landed, we’ll have much stronger primitives for repair
> orchestration in the database itself. But I don’t think that should block
> progress on a repair scheduling solution in the sidecar, and there is
> nothing that would prevent someone from continuing to use a sidecar-based
> solution in perpetuity if they preferred.
> >
> > - Scott
> >
> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad 
> wrote:
> > >
> > > I'm 100% in favor of repair being part of the core DB, not the
> sidecar.  The current (and past) state of things where running the DB
> correctly *requires* running a separate process (either community
> maintained or official C* sidecar) is incredibly painful for folks.  The
> idea that your data integrity needs to be opt-in has never made sense to me
> from the perspective of either the product or the end user.
> > >
> > > I've worked with way too many teams that have either configured this
> incorrectly or not at all.
> > >
> > > Ideally Cassandra would ship with repair built in and on by default.
> Power users can disable if they want to continue to maintain their own
> repair tooling for some reason.
> > >
> > > Jon
> > >
> > >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> > >> All,
> > >> We had a brief discussion in [2] about the Uber article [1] where
> they talk about having integrated repair into Cassandra and how great that
> is. I expressed my disappointment that they didn't work with the community
> on that (Uber, if you are listening time to make amends 

Re: [Discuss] Repair inside C*

2024-02-23 Thread Štefan Miklošovič
There are already some community solutions to scheduled repairs like this
(1), it runs along Cassandra node though ... anyway. I would like to see
that we pick what is the best already out there and try to integrate it
rather than trying to figure it all out again. That seems like a waste of
time and resources. If there is already something which "works" it would be
cool to spend some time first to get as much value from it as possible.

Just my 2 cents here

(1) https://github.com/Ericsson/ecchronos

On Fri, Feb 23, 2024 at 3:31 PM Josh McKenzie  wrote:

> we're all willing to bikeshed for our personal preference on where it
> lives and how it's implemented, and at the end of the day, code talks. I
> don't think anyone's said they'll die on the hill of implementation details
>
>
> :D
>
> I don't think we're going to be able to reach a consensus on an email
> thread with higher level abstractions and indicative statements. For
> instance: "a lot of complexity around repair in the main process" vs. "a
> lot of complexity in signaling between a sidecar and a main process and
> supporting multiple versions of C*". Both resonate with me at face value
> and neither contain enough detail to weigh against one another.
>
> A more granular, lower level CEP that includes a tradeoff of the two
> designs with a recommendation on a path forward might help unstick us from
> the ML back-and-forth.
>
> We could also take an indicative vote on "in-process vs. in-sidecar" to
> see if we can get a read on temperature.
>
> On Thu, Feb 22, 2024, at 2:06 PM, Paulo Motta wrote:
>
> Apologies, I just read the previous message and missed the previous
> discussion on sidecar vs main process on this thread. :-)
>
> It does not look like a final agreement was reached about this and there
> are lots of good arguments for both sides, but perhaps it would be nice to
> agree on this before a CEP is proposed since this will significantly
> influence the initial design?
>
> I tend to agree with Dinesh and Scott's pragmatic stance of providing
> initial support to repair scheduling on the sidecar, since this has fewer
> dependencies, and progressively move what makes sense to the main process
> as TCM/Accord primitives become available and mature.
>
> On Thu, Feb 22, 2024 at 1:44 PM Paulo Motta  wrote:
>
> +1 to Josh's points,  The project has considered native repair scheduling
> for a long time but it was never made a reality due to the complex
> considerations involved and availability of custom implementations/tools
> like cassandra-reaper, which is a popular way of scheduling repairs in
> Cassandra.
>
> Unfortunately I did not have cycles to review this proposal, but it looks
> promising from a quick glance.
>
> One important consideration that I think we need to discuss is: where
> should repair scheduling live: in the main process or the sidecar?
>
> I think there is a lot of complexity around repair in the main process and
> we need to be extra careful about adding additional complexity on top of
> that.
>
> Perhaps this could be a good opportunity to consider the sidecar to host
> repair scheduling, since this looks to be a control plane responsibility?
> One downside is that this would not make repair scheduling available to
> users who do not use the sidecar.
>
> What do you think? It would be great to have input from sidecar
> maintainers if this is something that would make sense for that subproject.
>
> On Thu, Feb 22, 2024 at 12:33 PM Josh McKenzie 
> wrote:
>
>
> Very late response from me here (basically necro'ing this thread).
>
> I think it'd be useful to get this condensed into a CEP that we can then
> discuss in that format. It's clearly something we all agree we need and
> having an implementation that works, even if it's not in your preferred
> execution domain, is vastly better than nothing IMO.
>
> I don't have cycles (nor background ;) ) to do that, but it sounds like
> you do Jaydeep given the implementation you have on a private fork + design.
>
> A non-exhaustive list of things that might be useful incorporating into or
> referencing from a CEP:
> Slack thread:
> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> Joey's old C* ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-14346
> Even older automatic repair scheduling:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Your design gdoc:
> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
> PR with automated repair:
> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>
> My intuition is that we're all basically in agreement that this is
> something the DB needs, we're all willing to bikeshed for our personal
> preference on where it lives and how it's implemented, and at the end of
> the day, code talks. I don't think anyone's said they'll die on the hill of
> implementation details, so that feels like CEP time to me.
>
> If you 

Re: [Discuss] Repair inside C*

2024-02-23 Thread Josh McKenzie
> we're all willing to bikeshed for our personal preference on where it lives 
> and how it's implemented, and at the end of the day, code talks. I don't 
> think anyone's said they'll die on the hill of implementation details

:D

I don't think we're going to be able to reach a consensus on an email thread 
with higher level abstractions and indicative statements. For instance: "a lot 
of complexity around repair in the main process" vs. "a lot of complexity in 
signaling between a sidecar and a main process and supporting multiple versions 
of C*". Both resonate with me at face value and neither contain enough detail 
to weigh against one another.

A more granular, lower level CEP that includes a tradeoff of the two designs 
with a recommendation on a path forward might help unstick us from the ML 
back-and-forth.

We could also take an indicative vote on "in-process vs. in-sidecar" to see if 
we can get a read on temperature.

On Thu, Feb 22, 2024, at 2:06 PM, Paulo Motta wrote:
> Apologies, I just read the previous message and missed the previous 
> discussion on sidecar vs main process on this thread. :-)
> 
> It does not look like a final agreement was reached about this and there are 
> lots of good arguments for both sides, but perhaps it would be nice to agree 
> on this before a CEP is proposed since this will significantly influence the 
> initial design?
> 
> I tend to agree with Dinesh and Scott's pragmatic stance of providing initial 
> support to repair scheduling on the sidecar, since this has fewer 
> dependencies, and progressively move what makes sense to the main process as 
> TCM/Accord primitives become available and mature.
> 
> On Thu, Feb 22, 2024 at 1:44 PM Paulo Motta  wrote:
>> +1 to Josh's points,  The project has considered native repair scheduling 
>> for a long time but it was never made a reality due to the complex 
>> considerations involved and availability of custom implementations/tools 
>> like cassandra-reaper, which is a popular way of scheduling repairs in 
>> Cassandra.
>> 
>> Unfortunately I did not have cycles to review this proposal, but it looks 
>> promising from a quick glance.
>> 
>> One important consideration that I think we need to discuss is: where should 
>> repair scheduling live: in the main process or the sidecar?
>> 
>> I think there is a lot of complexity around repair in the main process and 
>> we need to be extra careful about adding additional complexity on top of 
>> that.
>> 
>> Perhaps this could be a good opportunity to consider the sidecar to host 
>> repair scheduling, since this looks to be a control plane responsibility? 
>> One downside is that this would not make repair scheduling available to 
>> users who do not use the sidecar.
>> 
>> What do you think? It would be great to have input from sidecar maintainers 
>> if this is something that would make sense for that subproject.
>> 
>> On Thu, Feb 22, 2024 at 12:33 PM Josh McKenzie  wrote:
>>> __
>>> Very late response from me here (basically necro'ing this thread).
>>> 
>>> I think it'd be useful to get this condensed into a CEP that we can then 
>>> discuss in that format. It's clearly something we all agree we need and 
>>> having an implementation that works, even if it's not in your preferred 
>>> execution domain, is vastly better than nothing IMO.
>>> 
>>> I don't have cycles (nor background ;) ) to do that, but it sounds like you 
>>> do Jaydeep given the implementation you have on a private fork + design.
>>> 
>>> A non-exhaustive list of things that might be useful incorporating into or 
>>> referencing from a CEP:
>>> Slack thread: https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>>> Joey's old C* ticket: https://issues.apache.org/jira/browse/CASSANDRA-14346
>>> Even older automatic repair scheduling: 
>>> https://issues.apache.org/jira/browse/CASSANDRA-10070
>>> Your design gdoc: 
>>> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
>>> PR with automated repair: 
>>> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>>> 
>>> My intuition is that we're all basically in agreement that this is 
>>> something the DB needs, we're all willing to bikeshed for our personal 
>>> preference on where it lives and how it's implemented, and at the end of 
>>> the day, code talks. I don't think anyone's said they'll die on the hill of 
>>> implementation details, so that feels like CEP time to me.
>>> 
>>> If you were willing and able to get a CEP together for automated repair 
>>> based on the above material, given you've done the work and have the proof 
>>> points it's working at scale, I think this would be a *huge contribution* 
>>> to the community.
>>> 
>>> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
 Is anyone going to file an official CEP for this?
 As mentioned in this email thread, here is one of the solution's design 
 doc 
 

Re: [Discuss] Repair inside C*

2024-02-22 Thread Paulo Motta
Apologies, I just read the previous message and missed the previous
discussion on sidecar vs main process on this thread. :-)

It does not look like a final agreement was reached about this and there
are lots of good arguments for both sides, but perhaps it would be nice to
agree on this before a CEP is proposed since this will significantly
influence the initial design?

I tend to agree with Dinesh and Scott's pragmatic stance of providing
initial support to repair scheduling on the sidecar, since this has fewer
dependencies, and progressively move what makes sense to the main process
as TCM/Accord primitives become available and mature.

On Thu, Feb 22, 2024 at 1:44 PM Paulo Motta  wrote:

> +1 to Josh's points,  The project has considered native repair scheduling
> for a long time but it was never made a reality due to the complex
> considerations involved and availability of custom implementations/tools
> like cassandra-reaper, which is a popular way of scheduling repairs in
> Cassandra.
>
> Unfortunately I did not have cycles to review this proposal, but it looks
> promising from a quick glance.
>
> One important consideration that I think we need to discuss is: where
> should repair scheduling live: in the main process or the sidecar?
>
> I think there is a lot of complexity around repair in the main process and
> we need to be extra careful about adding additional complexity on top of
> that.
>
> Perhaps this could be a good opportunity to consider the sidecar to host
> repair scheduling, since this looks to be a control plane responsibility?
> One downside is that this would not make repair scheduling available to
> users who do not use the sidecar.
>
> What do you think? It would be great to have input from sidecar
> maintainers if this is something that would make sense for that subproject.
>
> On Thu, Feb 22, 2024 at 12:33 PM Josh McKenzie 
> wrote:
>
>> Very late response from me here (basically necro'ing this thread).
>>
>> I think it'd be useful to get this condensed into a CEP that we can then
>> discuss in that format. It's clearly something we all agree we need and
>> having an implementation that works, even if it's not in your preferred
>> execution domain, is vastly better than nothing IMO.
>>
>> I don't have cycles (nor background ;) ) to do that, but it sounds like
>> you do Jaydeep given the implementation you have on a private fork + design.
>>
>> A non-exhaustive list of things that might be useful incorporating into
>> or referencing from a CEP:
>> Slack thread:
>> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>> Joey's old C* ticket:
>> https://issues.apache.org/jira/browse/CASSANDRA-14346
>> Even older automatic repair scheduling:
>> https://issues.apache.org/jira/browse/CASSANDRA-10070
>> Your design gdoc:
>> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
>> PR with automated repair:
>> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>>
>> My intuition is that we're all basically in agreement that this is
>> something the DB needs, we're all willing to bikeshed for our personal
>> preference on where it lives and how it's implemented, and at the end of
>> the day, code talks. I don't think anyone's said they'll die on the hill of
>> implementation details, so that feels like CEP time to me.
>>
>> If you were willing and able to get a CEP together for automated repair
>> based on the above material, given you've done the work and have the proof
>> points it's working at scale, I think this would be a *huge contribution*
>> to the community.
>>
>> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
>>
>> Is anyone going to file an official CEP for this?
>> As mentioned in this email thread, here is one of the solution's design
>> doc
>> 
>> and source code on a private Apache Cassandra patch. Could you go through
>> it and let me know what you think?
>>
>> Jaydeep
>>
>> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad 
>> wrote:
>>
>> > That said I would happily support an effort to bring repair scheduling
>> to the sidecar immediately. This has nothing blocking it, and would
>> potentially enable the sidecar to provide an official repair scheduling
>> solution that is compatible with current or even previous versions of the
>> database.
>>
>> This is something I hadn't thought much about, and is a pretty good
>> argument for using the sidecar initially.  There's a lot of deployments out
>> there and having an official repair option would be a big win.
>>
>>
>> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
>> > I agree that it would be ideal for Cassandra to have a repair scheduler
>> in-DB.
>> >
>> > That said I would happily support an effort to bring repair scheduling
>> to the sidecar immediately. This has nothing blocking it, and would
>> 

Re: [Discuss] Repair inside C*

2024-02-22 Thread Paulo Motta
+1 to Josh's points,  The project has considered native repair scheduling
for a long time but it was never made a reality due to the complex
considerations involved and availability of custom implementations/tools
like cassandra-reaper, which is a popular way of scheduling repairs in
Cassandra.

Unfortunately I did not have cycles to review this proposal, but it looks
promising from a quick glance.

One important consideration that I think we need to discuss is: where
should repair scheduling live: in the main process or the sidecar?

I think there is a lot of complexity around repair in the main process and
we need to be extra careful about adding additional complexity on top of
that.

Perhaps this could be a good opportunity to consider the sidecar to host
repair scheduling, since this looks to be a control plane responsibility?
One downside is that this would not make repair scheduling available to
users who do not use the sidecar.

What do you think? It would be great to have input from sidecar maintainers
if this is something that would make sense for that subproject.

On Thu, Feb 22, 2024 at 12:33 PM Josh McKenzie  wrote:

> Very late response from me here (basically necro'ing this thread).
>
> I think it'd be useful to get this condensed into a CEP that we can then
> discuss in that format. It's clearly something we all agree we need and
> having an implementation that works, even if it's not in your preferred
> execution domain, is vastly better than nothing IMO.
>
> I don't have cycles (nor background ;) ) to do that, but it sounds like
> you do Jaydeep given the implementation you have on a private fork + design.
>
> A non-exhaustive list of things that might be useful incorporating into or
> referencing from a CEP:
> Slack thread:
> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> Joey's old C* ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-14346
> Even older automatic repair scheduling:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Your design gdoc:
> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
> PR with automated repair:
> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>
> My intuition is that we're all basically in agreement that this is
> something the DB needs, we're all willing to bikeshed for our personal
> preference on where it lives and how it's implemented, and at the end of
> the day, code talks. I don't think anyone's said they'll die on the hill of
> implementation details, so that feels like CEP time to me.
>
> If you were willing and able to get a CEP together for automated repair
> based on the above material, given you've done the work and have the proof
> points it's working at scale, I think this would be a *huge contribution*
> to the community.
>
> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
>
> Is anyone going to file an official CEP for this?
> As mentioned in this email thread, here is one of the solution's design
> doc
> 
> and source code on a private Apache Cassandra patch. Could you go through
> it and let me know what you think?
>
> Jaydeep
>
> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad 
> wrote:
>
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
>
> This is something I hadn't thought much about, and is a pretty good
> argument for using the sidecar initially.  There's a lot of deployments out
> there and having an official repair option would be a big win.
>
>
> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
> > I agree that it would be ideal for Cassandra to have a repair scheduler
> in-DB.
> >
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
> >
> > Once TCM has landed, we’ll have much stronger primitives for repair
> orchestration in the database itself. But I don’t think that should block
> progress on a repair scheduling solution in the sidecar, and there is
> nothing that would prevent someone from continuing to use a sidecar-based
> solution in perpetuity if they preferred.
> >
> > - Scott
> >
> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad 
> wrote:
> > >
> > > I'm 100% in favor of repair being part of the core DB, not the
> sidecar.  The current (and past) state of things where running the DB
> correctly *requires* running a separate process (either community
> maintained or 

Re: [Discuss] Repair inside C*

2024-02-22 Thread Josh McKenzie
Very late response from me here (basically necro'ing this thread).

I think it'd be useful to get this condensed into a CEP that we can then 
discuss in that format. It's clearly something we all agree we need and having 
an implementation that works, even if it's not in your preferred execution 
domain, is vastly better than nothing IMO.

I don't have cycles (nor background ;) ) to do that, but it sounds like you do 
Jaydeep given the implementation you have on a private fork + design.

A non-exhaustive list of things that might be useful incorporating into or 
referencing from a CEP:
Slack thread: https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
Joey's old C* ticket: https://issues.apache.org/jira/browse/CASSANDRA-14346
Even older automatic repair scheduling: 
https://issues.apache.org/jira/browse/CASSANDRA-10070
Your design gdoc: 
https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
PR with automated repair: 
https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c

My intuition is that we're all basically in agreement that this is something 
the DB needs, we're all willing to bikeshed for our personal preference on 
where it lives and how it's implemented, and at the end of the day, code talks. 
I don't think anyone's said they'll die on the hill of implementation details, 
so that feels like CEP time to me.

If you were willing and able to get a CEP together for automated repair based 
on the above material, given you've done the work and have the proof points 
it's working at scale, I think this would be a *huge contribution* to the 
community.

On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
> Is anyone going to file an official CEP for this?
> As mentioned in this email thread, here is one of the solution's design doc 
> 
>  and source code on a private Apache Cassandra patch. Could you go through it 
> and let me know what you think?
> 
> Jaydeep
> 
> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad  wrote:
>> > That said I would happily support an effort to bring repair scheduling to 
>> > the sidecar immediately. This has nothing blocking it, and would 
>> > potentially enable the sidecar to provide an official repair scheduling 
>> > solution that is compatible with current or even previous versions of the 
>> > database.
>> 
>> This is something I hadn't thought much about, and is a pretty good argument 
>> for using the sidecar initially.  There's a lot of deployments out there and 
>> having an official repair option would be a big win. 
>> 
>> 
>> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
>> > I agree that it would be ideal for Cassandra to have a repair scheduler 
>> > in-DB.
>> >
>> > That said I would happily support an effort to bring repair scheduling to 
>> > the sidecar immediately. This has nothing blocking it, and would 
>> > potentially enable the sidecar to provide an official repair scheduling 
>> > solution that is compatible with current or even previous versions of the 
>> > database.
>> >
>> > Once TCM has landed, we’ll have much stronger primitives for repair 
>> > orchestration in the database itself. But I don’t think that should block 
>> > progress on a repair scheduling solution in the sidecar, and there is 
>> > nothing that would prevent someone from continuing to use a sidecar-based 
>> > solution in perpetuity if they preferred.
>> >
>> > - Scott
>> >
>> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad  
>> > > wrote:
>> > >
>> > > I'm 100% in favor of repair being part of the core DB, not the sidecar. 
>> > >  The current (and past) state of things where running the DB correctly 
>> > > *requires* running a separate process (either community maintained or 
>> > > official C* sidecar) is incredibly painful for folks.  The idea that 
>> > > your data integrity needs to be opt-in has never made sense to me from 
>> > > the perspective of either the product or the end user.
>> > >
>> > > I've worked with way too many teams that have either configured this 
>> > > incorrectly or not at all. 
>> > >
>> > > Ideally Cassandra would ship with repair built in and on by default.  
>> > > Power users can disable if they want to continue to maintain their own 
>> > > repair tooling for some reason.
>> > >
>> > > Jon
>> > >
>> > >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
>> > >> All,
>> > >> We had a brief discussion in [2] about the Uber article [1] where they 
>> > >> talk about having integrated repair into Cassandra and how great that 
>> > >> is. I expressed my disappointment that they didn't work with the 
>> > >> community on that (Uber, if you are listening time to make amends ) 
>> > >> and it turns out Joey already had the idea and wrote the code [3] - so 
>> > >> I wanted to start a discussion to gauge interest and maybe how to 
>> 

Re: [Discuss] Repair inside C*

2023-08-24 Thread Jaydeep Chovatia
Is anyone going to file an official CEP for this?
As mentioned in this email thread, here is one of the solution's design doc

and source code on a private Apache Cassandra patch. Could you go through
it and let me know what you think?

Jaydeep

On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad 
wrote:

> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
>
> This is something I hadn't thought much about, and is a pretty good
> argument for using the sidecar initially.  There's a lot of deployments out
> there and having an official repair option would be a big win.
>
>
> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
> > I agree that it would be ideal for Cassandra to have a repair scheduler
> in-DB.
> >
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
> >
> > Once TCM has landed, we’ll have much stronger primitives for repair
> orchestration in the database itself. But I don’t think that should block
> progress on a repair scheduling solution in the sidecar, and there is
> nothing that would prevent someone from continuing to use a sidecar-based
> solution in perpetuity if they preferred.
> >
> > - Scott
> >
> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad 
> wrote:
> > >
> > > I'm 100% in favor of repair being part of the core DB, not the
> sidecar.  The current (and past) state of things where running the DB
> correctly *requires* running a separate process (either community
> maintained or official C* sidecar) is incredibly painful for folks.  The
> idea that your data integrity needs to be opt-in has never made sense to me
> from the perspective of either the product or the end user.
> > >
> > > I've worked with way too many teams that have either configured this
> incorrectly or not at all.
> > >
> > > Ideally Cassandra would ship with repair built in and on by default.
> Power users can disable if they want to continue to maintain their own
> repair tooling for some reason.
> > >
> > > Jon
> > >
> > >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> > >> All,
> > >> We had a brief discussion in [2] about the Uber article [1] where
> they talk about having integrated repair into Cassandra and how great that
> is. I expressed my disappointment that they didn't work with the community
> on that (Uber, if you are listening time to make amends ) and it turns
> out Joey already had the idea and wrote the code [3] - so I wanted to start
> a discussion to gauge interest and maybe how to revive that effort.
> > >> Thanks,
> > >> German
> > >> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> > >> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> > >> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
> >
>


Re: [Discuss] Repair inside C*

2023-08-02 Thread Jon Haddad
> That said I would happily support an effort to bring repair scheduling to the 
> sidecar immediately. This has nothing blocking it, and would potentially 
> enable the sidecar to provide an official repair scheduling solution that is 
> compatible with current or even previous versions of the database.

This is something I hadn't thought much about, and is a pretty good argument 
for using the sidecar initially.  There's a lot of deployments out there and 
having an official repair option would be a big win.  


On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
> I agree that it would be ideal for Cassandra to have a repair scheduler in-DB.
> 
> That said I would happily support an effort to bring repair scheduling to the 
> sidecar immediately. This has nothing blocking it, and would potentially 
> enable the sidecar to provide an official repair scheduling solution that is 
> compatible with current or even previous versions of the database.
> 
> Once TCM has landed, we’ll have much stronger primitives for repair 
> orchestration in the database itself. But I don’t think that should block 
> progress on a repair scheduling solution in the sidecar, and there is nothing 
> that would prevent someone from continuing to use a sidecar-based solution in 
> perpetuity if they preferred.
> 
> - Scott
> 
> > On Jul 26, 2023, at 3:25 PM, Jon Haddad  wrote:
> > 
> > I'm 100% in favor of repair being part of the core DB, not the sidecar.  
> > The current (and past) state of things where running the DB correctly 
> > *requires* running a separate process (either community maintained or 
> > official C* sidecar) is incredibly painful for folks.  The idea that your 
> > data integrity needs to be opt-in has never made sense to me from the 
> > perspective of either the product or the end user.
> > 
> > I've worked with way too many teams that have either configured this 
> > incorrectly or not at all.  
> > 
> > Ideally Cassandra would ship with repair built in and on by default.  Power 
> > users can disable if they want to continue to maintain their own repair 
> > tooling for some reason.
> > 
> > Jon
> > 
> >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> >> All,
> >> We had a brief discussion in [2] about the Uber article [1] where they 
> >> talk about having integrated repair into Cassandra and how great that is. 
> >> I expressed my disappointment that they didn't work with the community on 
> >> that (Uber, if you are listening time to make amends ) and it turns out 
> >> Joey already had the idea and wrote the code [3] - so I wanted to start a 
> >> discussion to gauge interest and maybe how to revive that effort.
> >> Thanks,
> >> German
> >> [1] 
> >> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> >> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> >> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
> 


Re: [Discuss] Repair inside C*

2023-07-27 Thread Josh McKenzie
> The idea that your data integrity needs to be opt-in has never made sense to 
> me from the perspective of either the product or the end user.
I could not agree with this more. 100%.

> The current (and past) state of things where running the DB correctly 
> **requires* *running a separate process (either community maintained or 
> official C* sidecar) is incredibly painful for folks.
I'm 50/50 on this (and I have some opinions here; bear with me :D ).

To me this goes beyond the question of just "where do we coordinate repair" 
into "what role does a node play vs. the sidecar and how does that intersect 
w/the industry today".

Having just 1 process you run on N machines is much nicer from an operations 
standpoint and it's *much* cleaner and easier for us as a project to not have 
to deal with signaling, shmem, and going down the IPC rabbit hole. A modular 
monolith, if you will.

That said, I feel like zeitgeist has been all-in in terms of microservices and 
control planes, whether they're the right solution or not. The affordances on 
being able to build out independent teams and large organization dev velocity, 
never-mind the ideal of being able to cleanly upgrade or rewrite internal 
components, is attractive enough on paper that it feels like most groups have 
gone that direction and accepted the perceived costs; I view Cassandra as being 
something of an architectural anachronism at this point. And to call back to 
the prior paragraph, I *think* you get all those positive affordances w/a 
modular monolith. Sadly, google trends 

 don't really give me a lot of hope there.

In an ideal world operators (or better yet, an automated operations process) 
would be able to dynamically adjust resource allocation to nodes based on 
"burstiness of the buffering" (i.e. lots of data building up in CL's needing to 
be flushed, or compaction need, or repair need); It's not immediately obvious 
to me how we'd gracefully do that in a single process paradigm in containers 
w/out becoming a noisy neighbor but it's not impossible. Kind of goes meta 
outside C*'s scope into how you're coordinating your hardware and software 
interactions; maybe that's the cleaner route: we clearly signal metrics for 
each major operation the DB needs to do to indicate their backlog and an 
external orchestration process / system / ??? handles the resource allocation. 
i.e. we don't take that on.

Certainly we can do a lot better when it comes to internal scheduling of DB 
operations to one another than we are today (start using cql rate limiting, 
dynamically determine a rolling average of needs to smooth out burst requests, 
make byte-based rate-limiting an option, user-space threads w/loom and some 
kind of QoS prioritization based on backlogs, etc).

I personally view moving maintenance tasks into the sidecar as a reasonable 
"first step satisficing compromise". If anything, that'd potentially give us 
some breathing room to get our house in order on the "I/O" process (as opposed 
to sidecar as "maintenance process") to then re-integrate things back in in a 
more clean, planned fashion with some better tools to do it right.

~Josh


On Wed, Jul 26, 2023, at 7:20 PM, C. Scott Andreas wrote:
> I agree that it would be ideal for Cassandra to have a repair scheduler in-DB.
> 
> That said I would happily support an effort to bring repair scheduling to the 
> sidecar immediately. This has nothing blocking it, and would potentially 
> enable the sidecar to provide an official repair scheduling solution that is 
> compatible with current or even previous versions of the database.
> 
> Once TCM has landed, we’ll have much stronger primitives for repair 
> orchestration in the database itself. But I don’t think that should block 
> progress on a repair scheduling solution in the sidecar, and there is nothing 
> that would prevent someone from continuing to use a sidecar-based solution in 
> perpetuity if they preferred.
> 
> - Scott
> 
> > On Jul 26, 2023, at 3:25 PM, Jon Haddad  wrote:
> > 
> > I'm 100% in favor of repair being part of the core DB, not the sidecar.  
> > The current (and past) state of things where running the DB correctly 
> > *requires* running a separate process (either community maintained or 
> > official C* sidecar) is incredibly painful for folks.  The idea that your 
> > data integrity needs to be opt-in has never made sense to me from the 
> > perspective of either the product or the end user.
> > 
> > I've worked with way too many teams that have either configured this 
> > incorrectly or not at all.  
> > 
> > Ideally Cassandra would ship with repair built in and on by default.  Power 
> > users can disable if they want to continue to maintain their own repair 
> > tooling for some reason.
> > 
> > Jon
> > 
> >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> >> All,
> >> We had a brief discussion in [2] 

Re: [Discuss] Repair inside C*

2023-07-26 Thread C. Scott Andreas
I agree that it would be ideal for Cassandra to have a repair scheduler in-DB.

That said I would happily support an effort to bring repair scheduling to the 
sidecar immediately. This has nothing blocking it, and would potentially enable 
the sidecar to provide an official repair scheduling solution that is 
compatible with current or even previous versions of the database.

Once TCM has landed, we’ll have much stronger primitives for repair 
orchestration in the database itself. But I don’t think that should block 
progress on a repair scheduling solution in the sidecar, and there is nothing 
that would prevent someone from continuing to use a sidecar-based solution in 
perpetuity if they preferred.

- Scott

> On Jul 26, 2023, at 3:25 PM, Jon Haddad  wrote:
> 
> I'm 100% in favor of repair being part of the core DB, not the sidecar.  The 
> current (and past) state of things where running the DB correctly *requires* 
> running a separate process (either community maintained or official C* 
> sidecar) is incredibly painful for folks.  The idea that your data integrity 
> needs to be opt-in has never made sense to me from the perspective of either 
> the product or the end user.
> 
> I've worked with way too many teams that have either configured this 
> incorrectly or not at all.  
> 
> Ideally Cassandra would ship with repair built in and on by default.  Power 
> users can disable if they want to continue to maintain their own repair 
> tooling for some reason.
> 
> Jon
> 
>> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
>> All,
>> We had a brief discussion in [2] about the Uber article [1] where they talk 
>> about having integrated repair into Cassandra and how great that is. I 
>> expressed my disappointment that they didn't work with the community on that 
>> (Uber, if you are listening time to make amends ) and it turns out Joey 
>> already had the idea and wrote the code [3] - so I wanted to start a 
>> discussion to gauge interest and maybe how to revive that effort.
>> Thanks,
>> German
>> [1] 
>> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
>> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346


Re: [Discuss] Repair inside C*

2023-07-26 Thread Dinesh Joshi
I concur, repair is an intrinsic part of the database and belongs inside it. We 
can certainly expose a REST control plane API via the sidecar for triggering it 
on demand, scheduling, etc.

That said, there are various implementation of repair scheduling and 
orchestration that a lot of organizations maintain in their proprietary 
sidecars. It would be beneficial in the interim to consolidate on a common 
solution in the sidecar. Eventually we need a version of repair in the database 
that just works without the need of any operator intervention.


> On Jul 26, 2023, at 3:25 PM, Jon Haddad  wrote:
> 
> I'm 100% in favor of repair being part of the core DB, not the sidecar.  The 
> current (and past) state of things where running the DB correctly *requires* 
> running a separate process (either community maintained or official C* 
> sidecar) is incredibly painful for folks.  The idea that your data integrity 
> needs to be opt-in has never made sense to me from the perspective of either 
> the product or the end user.
> 
> I've worked with way too many teams that have either configured this 
> incorrectly or not at all.  
> 
> Ideally Cassandra would ship with repair built in and on by default.  Power 
> users can disable if they want to continue to maintain their own repair 
> tooling for some reason. 
> 
> Jon
> 
> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
>> All,
>> 
>> We had a brief discussion in [2] about the Uber article [1] where they talk 
>> about having integrated repair into Cassandra and how great that is. I 
>> expressed my disappointment that they didn't work with the community on that 
>> (Uber, if you are listening time to make amends ) and it turns out Joey 
>> already had the idea and wrote the code [3] - so I wanted to start a 
>> discussion to gauge interest and maybe how to revive that effort.
>> 
>> Thanks,
>> German
>> 
>> [1] 
>> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
>> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>> 



Re: [Discuss] Repair inside C*

2023-07-26 Thread Jon Haddad
I'm 100% in favor of repair being part of the core DB, not the sidecar.  The 
current (and past) state of things where running the DB correctly *requires* 
running a separate process (either community maintained or official C* sidecar) 
is incredibly painful for folks.  The idea that your data integrity needs to be 
opt-in has never made sense to me from the perspective of either the product or 
the end user.

I've worked with way too many teams that have either configured this 
incorrectly or not at all.  

Ideally Cassandra would ship with repair built in and on by default.  Power 
users can disable if they want to continue to maintain their own repair tooling 
for some reason. 

Jon

On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> All,
> 
> We had a brief discussion in [2] about the Uber article [1] where they talk 
> about having integrated repair into Cassandra and how great that is. I 
> expressed my disappointment that they didn't work with the community on that 
> (Uber, if you are listening time to make amends ) and it turns out Joey 
> already had the idea and wrote the code [3] - so I wanted to start a 
> discussion to gauge interest and maybe how to revive that effort.
> 
> Thanks,
> German
> 
> [1] 
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
> 


Re: [Discuss] Repair inside C*

2023-07-26 Thread David Capwell
+0 to sidecar, in order to make that work well we need to expose state that the 
node has so the sidecar can make good calls, if it runs in the node then 
nothing has to be exposed.  One thing to flesh out is where do the “smarts” 
live?  If the range has too many partitions, which system knows to subdivide 
the range and sequence the repairs (else you OOM)?  “Should” repair itself be 
better and take all input and make sure it works correctly, so the caller just 
worries about scheduling?  “Should” the scheduler understand limitations with 
repair and work around them?

> On Jul 25, 2023, at 11:26 AM, Jeremiah Jordan  
> wrote:
> 
> +1 for the side car being the right location.
> 
> -Jeremiah
> 
> On Jul 25, 2023 at 1:16:14 PM, Chris Lohfink  <mailto:clohfin...@gmail.com>> wrote:
>> I think a CEP is the next step. Considering the number of companies 
>> involved, this might necessitate several drafts and rounds of discussions. I 
>> appreciate your initiative in starting this process, and I'm eager to 
>> contribute to the ensuing discussions. Maybe in a google docs or something 
>> initially for more interactive feedback?
>> 
>> In regards to https://issues.apache.org/jira/browse/CASSANDRA-14346 we at 
>> Netflix are actually putting effort currently to move this into the sidecar 
>> as the idea was to start moving non-read/write path things into different 
>> process and jvms to not impact each other.
>> 
>> I think the sidecar/in process discussion might be a bit contentious as I 
>> know even things like compaction some feel should be moved out of process in 
>> future. On a personal note, my primary interest lies in seeing the 
>> implementation realized, so I am willing to support whatever consensus 
>> emerges. Whichever direction these go we will help with the implementation.
>> 
>> Chris
>> 
>> On Tue, Jul 25, 2023 at 1:09 PM Jaydeep Chovatia > <mailto:chovatia.jayd...@gmail.com>> wrote:
>>> Sounds good, German. Feel free to let me know if you need my help in filing 
>>> CEP, adding supporting content to the CEP, etc.
>>> As I mentioned previously, I have already been working (going through an 
>>> internal review) on creating a one-pager doc, code, etc., that has been 
>>> working for us for the last six years at an immense scale, and I will share 
>>> it soon on a private fork.
>>> 
>>> Thanks,
>>> Jaydeep
>>> 
>>> On Tue, Jul 25, 2023 at 9:48 AM German Eichberger via dev 
>>> mailto:dev@cassandra.apache.org>> wrote:
>>>> In [2] we suggested that the next step should be a CEP.
>>>> 
>>>> I am happy to lend a hand to this effort as well.
>>>> 
>>>> Thanks Jaydeep and David - really appreciated.
>>>> 
>>>> German
>>>> 
>>>> From: David Capwell mailto:dcapw...@apple.com>>
>>>> Sent: Tuesday, July 25, 2023 8:32 AM
>>>> To: dev mailto:dev@cassandra.apache.org>>
>>>> Cc: German Eichberger >>> <mailto:german.eichber...@microsoft.com>>
>>>> Subject: [EXTERNAL] Re: [Discuss] Repair inside C*
>>>>  
>>>> As someone who has done a lot of work trying to make repair stable, I 
>>>> approve of this message ^_^
>>>> 
>>>> More than glad to help mentor this work
>>>> 
>>>> On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia >>> <mailto:chovatia.jayd...@gmail.com>> wrote:
>>>> 
>>>> To clarify the repair solution timing, the one we have listed in the 
>>>> article is not the recently developed one. We were hitting some 
>>>> high-priority production challenges back in early 2018, and to address 
>>>> that, we developed and rolled out the solution in production in just a few 
>>>> months. The timing-wise, the solution was developed and productized by Q3 
>>>> 2018, of course, continued to evolve thereafter. Usually, we explore the 
>>>> existing solutions we can leverage, but when we started our journey in 
>>>> early 2018, most of the solutions were based on sidecar solutions. There 
>>>> is nothing against the sidecar solution; it was just a pure business 
>>>> decision, and in that, we wanted to avoid the sidecar to avoid a 
>>>> dependency on the control plane. Every solution developed has its deep 
>>>> context, merits, and pros and cons; they are all great solutions! 
>>>> 
>>>> An appeal to the community members is to think one more time about having 
>>>> re

Re: [Discuss] Repair inside C*

2023-07-25 Thread Jeremiah Jordan
 +1 for the side car being the right location.

-Jeremiah

On Jul 25, 2023 at 1:16:14 PM, Chris Lohfink  wrote:

> I think a CEP is the next step. Considering the number of companies
> involved, this might necessitate several drafts and rounds of discussions.
> I appreciate your initiative in starting this process, and I'm eager to
> contribute to the ensuing discussions. Maybe in a google docs or something
> initially for more interactive feedback?
>
> In regards to https://issues.apache.org/jira/browse/CASSANDRA-14346 we at
> Netflix are actually putting effort currently to move this into the sidecar
> as the idea was to start moving non-read/write path things into different
> process and jvms to not impact each other.
>
> I think the sidecar/in process discussion might be a bit contentious as I
> know even things like compaction some feel should be moved out of process
> in future. On a personal note, my primary interest lies in seeing the
> implementation realized, so I am willing to support whatever consensus
> emerges. Whichever direction these go we will help with the implementation.
>
> Chris
>
> On Tue, Jul 25, 2023 at 1:09 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Sounds good, German. Feel free to let me know if you need my help
>> in filing CEP, adding supporting content to the CEP, etc.
>> As I mentioned previously, I have already been working (going through an
>> internal review) on creating a one-pager doc, code, etc., that has been
>> working for us for the last six years at an immense scale, and I will share
>> it soon on a private fork.
>>
>> Thanks,
>> Jaydeep
>>
>> On Tue, Jul 25, 2023 at 9:48 AM German Eichberger via dev <
>> dev@cassandra.apache.org> wrote:
>>
>>> In [2] we suggested that the next step should be a CEP.
>>>
>>> I am happy to lend a hand to this effort as well.
>>>
>>> Thanks Jaydeep and David - really appreciated.
>>>
>>> German
>>>
>>> --
>>> *From:* David Capwell 
>>> *Sent:* Tuesday, July 25, 2023 8:32 AM
>>> *To:* dev 
>>> *Cc:* German Eichberger 
>>> *Subject:* [EXTERNAL] Re: [Discuss] Repair inside C*
>>>
>>> As someone who has done a lot of work trying to make repair stable, I
>>> approve of this message ^_^
>>>
>>> More than glad to help mentor this work
>>>
>>> On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>> To clarify the repair solution timing, the one we have listed in the
>>> article is not the recently developed one. We were hitting some
>>> high-priority production challenges back in early 2018, and to address
>>> that, we developed and rolled out the solution in production in just a few
>>> months. The timing-wise, the solution was developed and productized by Q3
>>> 2018, of course, continued to evolve thereafter. Usually, we explore the
>>> existing solutions we can leverage, but when we started our journey in
>>> early 2018, most of the solutions were based on sidecar solutions. There is
>>> nothing against the sidecar solution; it was just a pure business decision,
>>> and in that, we wanted to avoid the sidecar to avoid a dependency on the
>>> control plane. Every solution developed has its deep context, merits, and
>>> pros and cons; they are all great solutions!
>>>
>>> An appeal to the community members is to think one more time about
>>> having repairs in the Open Source Cassandra itself. As mentioned in my
>>> previous email, any solution getting adopted is fine; the important aspect
>>> is to have a repair solution in the OSS Cassandra itself!
>>>
>>> Yours Faithfully,
>>> Jaydeep
>>>
>>> On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>> Hi German,
>>>
>>> The goal is always to backport our learnings back to the community. For
>>> example, I have already successfully backported the following two
>>> enhancements/bug fixes back to the Open Source Cassandra, which are
>>> described in the article. I am already currently working on open-source a
>>> few more enhancements mentioned in the article back to the open-source.
>>>
>>>1. https://issues.apache.org/jira/browse/CASSANDRA-18555
>>>2. https://issues.apache.org/jira/browse/CASSANDRA-13740
>>>
>>> There is definitely heavy interest in having the repai

Re: [Discuss] Repair inside C*

2023-07-25 Thread Chris Lohfink
I think a CEP is the next step. Considering the number of companies
involved, this might necessitate several drafts and rounds of discussions.
I appreciate your initiative in starting this process, and I'm eager to
contribute to the ensuing discussions. Maybe in a google docs or something
initially for more interactive feedback?

In regards to https://issues.apache.org/jira/browse/CASSANDRA-14346 we at
Netflix are actually putting effort currently to move this into the sidecar
as the idea was to start moving non-read/write path things into different
process and jvms to not impact each other.

I think the sidecar/in process discussion might be a bit contentious as I
know even things like compaction some feel should be moved out of process
in future. On a personal note, my primary interest lies in seeing the
implementation realized, so I am willing to support whatever consensus
emerges. Whichever direction these go we will help with the implementation.

Chris

On Tue, Jul 25, 2023 at 1:09 PM Jaydeep Chovatia 
wrote:

> Sounds good, German. Feel free to let me know if you need my help
> in filing CEP, adding supporting content to the CEP, etc.
> As I mentioned previously, I have already been working (going through an
> internal review) on creating a one-pager doc, code, etc., that has been
> working for us for the last six years at an immense scale, and I will share
> it soon on a private fork.
>
> Thanks,
> Jaydeep
>
> On Tue, Jul 25, 2023 at 9:48 AM German Eichberger via dev <
> dev@cassandra.apache.org> wrote:
>
>> In [2] we suggested that the next step should be a CEP.
>>
>> I am happy to lend a hand to this effort as well.
>>
>> Thanks Jaydeep and David - really appreciated.
>>
>> German
>>
>> --
>> *From:* David Capwell 
>> *Sent:* Tuesday, July 25, 2023 8:32 AM
>> *To:* dev 
>> *Cc:* German Eichberger 
>> *Subject:* [EXTERNAL] Re: [Discuss] Repair inside C*
>>
>> As someone who has done a lot of work trying to make repair stable, I
>> approve of this message ^_^
>>
>> More than glad to help mentor this work
>>
>> On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia 
>> wrote:
>>
>> To clarify the repair solution timing, the one we have listed in the
>> article is not the recently developed one. We were hitting some
>> high-priority production challenges back in early 2018, and to address
>> that, we developed and rolled out the solution in production in just a few
>> months. The timing-wise, the solution was developed and productized by Q3
>> 2018, of course, continued to evolve thereafter. Usually, we explore the
>> existing solutions we can leverage, but when we started our journey in
>> early 2018, most of the solutions were based on sidecar solutions. There is
>> nothing against the sidecar solution; it was just a pure business decision,
>> and in that, we wanted to avoid the sidecar to avoid a dependency on the
>> control plane. Every solution developed has its deep context, merits, and
>> pros and cons; they are all great solutions!
>>
>> An appeal to the community members is to think one more time about having
>> repairs in the Open Source Cassandra itself. As mentioned in my previous
>> email, any solution getting adopted is fine; the important aspect is to
>> have a repair solution in the OSS Cassandra itself!
>>
>> Yours Faithfully,
>> Jaydeep
>>
>> On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>> Hi German,
>>
>> The goal is always to backport our learnings back to the community. For
>> example, I have already successfully backported the following two
>> enhancements/bug fixes back to the Open Source Cassandra, which are
>> described in the article. I am already currently working on open-source a
>> few more enhancements mentioned in the article back to the open-source.
>>
>>1. https://issues.apache.org/jira/browse/CASSANDRA-18555
>>2. https://issues.apache.org/jira/browse/CASSANDRA-13740
>>
>> There is definitely heavy interest in having the repair solution inside
>> the Open Source Cassandra itself, very much like Compaction. As I write
>> this email, we are internally working on a one-pager proposal doc to all
>> the community members on having a repair inside the OSS Apache Cassandra
>> along with our private fork - I will share it soon.
>>
>> Generally, we are ok with any solution getting adopted (either Joey's
>> solution or our repair solution or any other solution). The primary
>> motivation is to have the repair embedded inside the open-source Cassandra
>&g

Re: [Discuss] Repair inside C*

2023-07-25 Thread Jaydeep Chovatia
Sounds good, German. Feel free to let me know if you need my help in filing
CEP, adding supporting content to the CEP, etc.
As I mentioned previously, I have already been working (going through an
internal review) on creating a one-pager doc, code, etc., that has been
working for us for the last six years at an immense scale, and I will share
it soon on a private fork.

Thanks,
Jaydeep

On Tue, Jul 25, 2023 at 9:48 AM German Eichberger via dev <
dev@cassandra.apache.org> wrote:

> In [2] we suggested that the next step should be a CEP.
>
> I am happy to lend a hand to this effort as well.
>
> Thanks Jaydeep and David - really appreciated.
>
> German
>
> --
> *From:* David Capwell 
> *Sent:* Tuesday, July 25, 2023 8:32 AM
> *To:* dev 
> *Cc:* German Eichberger 
> *Subject:* [EXTERNAL] Re: [Discuss] Repair inside C*
>
> As someone who has done a lot of work trying to make repair stable, I
> approve of this message ^_^
>
> More than glad to help mentor this work
>
> On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia 
> wrote:
>
> To clarify the repair solution timing, the one we have listed in the
> article is not the recently developed one. We were hitting some
> high-priority production challenges back in early 2018, and to address
> that, we developed and rolled out the solution in production in just a few
> months. The timing-wise, the solution was developed and productized by Q3
> 2018, of course, continued to evolve thereafter. Usually, we explore the
> existing solutions we can leverage, but when we started our journey in
> early 2018, most of the solutions were based on sidecar solutions. There is
> nothing against the sidecar solution; it was just a pure business decision,
> and in that, we wanted to avoid the sidecar to avoid a dependency on the
> control plane. Every solution developed has its deep context, merits, and
> pros and cons; they are all great solutions!
>
> An appeal to the community members is to think one more time about having
> repairs in the Open Source Cassandra itself. As mentioned in my previous
> email, any solution getting adopted is fine; the important aspect is to
> have a repair solution in the OSS Cassandra itself!
>
> Yours Faithfully,
> Jaydeep
>
> On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
> Hi German,
>
> The goal is always to backport our learnings back to the community. For
> example, I have already successfully backported the following two
> enhancements/bug fixes back to the Open Source Cassandra, which are
> described in the article. I am already currently working on open-source a
> few more enhancements mentioned in the article back to the open-source.
>
>1. https://issues.apache.org/jira/browse/CASSANDRA-18555
>2. https://issues.apache.org/jira/browse/CASSANDRA-13740
>
> There is definitely heavy interest in having the repair solution inside
> the Open Source Cassandra itself, very much like Compaction. As I write
> this email, we are internally working on a one-pager proposal doc to all
> the community members on having a repair inside the OSS Apache Cassandra
> along with our private fork - I will share it soon.
>
> Generally, we are ok with any solution getting adopted (either Joey's
> solution or our repair solution or any other solution). The primary
> motivation is to have the repair embedded inside the open-source Cassandra
> itself, so we can retire all various privately developed solutions
> eventually :)
>
> I am also happy to help (drive conversation, discussion, etc.) in any way
> to have a repair solution adopted inside Cassandra itself, please let me
> know. Happy to help!
>
> Yours Faithfully,
> Jaydeep
>
> On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
> dev@cassandra.apache.org> wrote:
>
> All,
>
> We had a brief discussion in [2] about the Uber article [1] where they
> talk about having integrated repair into Cassandra and how great that is. I
> expressed my disappointment that they didn't work with the community on
> that (Uber, if you are listening time to make amends ) and it turns out
> Joey already had the idea and wrote the code [3] - so I wanted to start a
> discussion to gauge interest and maybe how to revive that effort.
>
> Thanks,
> German
>
> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>
>
>


Re: [Discuss] Repair inside C*

2023-07-25 Thread German Eichberger via dev
In [2] we suggested that the next step should be a CEP.

I am happy to lend a hand to this effort as well.

Thanks Jaydeep and David - really appreciated.

German


From: David Capwell 
Sent: Tuesday, July 25, 2023 8:32 AM
To: dev 
Cc: German Eichberger 
Subject: [EXTERNAL] Re: [Discuss] Repair inside C*

As someone who has done a lot of work trying to make repair stable, I approve 
of this message ^_^

More than glad to help mentor this work

On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia  
wrote:

To clarify the repair solution timing, the one we have listed in the article is 
not the recently developed one. We were hitting some high-priority production 
challenges back in early 2018, and to address that, we developed and rolled out 
the solution in production in just a few months. The timing-wise, the solution 
was developed and productized by Q3 2018, of course, continued to evolve 
thereafter. Usually, we explore the existing solutions we can leverage, but 
when we started our journey in early 2018, most of the solutions were based on 
sidecar solutions. There is nothing against the sidecar solution; it was just a 
pure business decision, and in that, we wanted to avoid the sidecar to avoid a 
dependency on the control plane. Every solution developed has its deep context, 
merits, and pros and cons; they are all great solutions!

An appeal to the community members is to think one more time about having 
repairs in the Open Source Cassandra itself. As mentioned in my previous email, 
any solution getting adopted is fine; the important aspect is to have a repair 
solution in the OSS Cassandra itself!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia 
mailto:chovatia.jayd...@gmail.com>> wrote:
Hi German,

The goal is always to backport our learnings back to the community. For 
example, I have already successfully backported the following two 
enhancements/bug fixes back to the Open Source Cassandra, which are described 
in the article. I am already currently working on open-source a few more 
enhancements mentioned in the article back to the open-source.

  1.  https://issues.apache.org/jira/browse/CASSANDRA-18555
  2.  https://issues.apache.org/jira/browse/CASSANDRA-13740

There is definitely heavy interest in having the repair solution inside the 
Open Source Cassandra itself, very much like Compaction. As I write this email, 
we are internally working on a one-pager proposal doc to all the community 
members on having a repair inside the OSS Apache Cassandra along with our 
private fork - I will share it soon.

Generally, we are ok with any solution getting adopted (either Joey's solution 
or our repair solution or any other solution). The primary motivation is to 
have the repair embedded inside the open-source Cassandra itself, so we can 
retire all various privately developed solutions eventually :)

I am also happy to help (drive conversation, discussion, etc.) in any way to 
have a repair solution adopted inside Cassandra itself, please let me know. 
Happy to help!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev 
mailto:dev@cassandra.apache.org>> wrote:
All,

We had a brief discussion in [2] about the Uber article [1] where they talk 
about having integrated repair into Cassandra and how great that is. I 
expressed my disappointment that they didn't work with the community on that 
(Uber, if you are listening time to make amends ) and it turns out Joey 
already had the idea and wrote the code [3] - so I wanted to start a discussion 
to gauge interest and maybe how to revive that effort.

Thanks,
German

[1] https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
[2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
[3] https://issues.apache.org/jira/browse/CASSANDRA-14346



Re: [Discuss] Repair inside C*

2023-07-25 Thread David Capwell
As someone who has done a lot of work trying to make repair stable, I approve 
of this message ^_^

More than glad to help mentor this work

> On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia  
> wrote:
> 
> To clarify the repair solution timing, the one we have listed in the article 
> is not the recently developed one. We were hitting some high-priority 
> production challenges back in early 2018, and to address that, we developed 
> and rolled out the solution in production in just a few months. The 
> timing-wise, the solution was developed and productized by Q3 2018, of 
> course, continued to evolve thereafter. Usually, we explore the existing 
> solutions we can leverage, but when we started our journey in early 2018, 
> most of the solutions were based on sidecar solutions. There is nothing 
> against the sidecar solution; it was just a pure business decision, and in 
> that, we wanted to avoid the sidecar to avoid a dependency on the control 
> plane. Every solution developed has its deep context, merits, and pros and 
> cons; they are all great solutions! 
> 
> An appeal to the community members is to think one more time about having 
> repairs in the Open Source Cassandra itself. As mentioned in my previous 
> email, any solution getting adopted is fine; the important aspect is to have 
> a repair solution in the OSS Cassandra itself!
> 
> Yours Faithfully,
> Jaydeep
> 
> On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia  > wrote:
>> Hi German,
>> 
>> The goal is always to backport our learnings back to the community. For 
>> example, I have already successfully backported the following two 
>> enhancements/bug fixes back to the Open Source Cassandra, which are 
>> described in the article. I am already currently working on open-source a 
>> few more enhancements mentioned in the article back to the open-source.
>> https://issues.apache.org/jira/browse/CASSANDRA-18555
>> https://issues.apache.org/jira/browse/CASSANDRA-13740
>> There is definitely heavy interest in having the repair solution inside the 
>> Open Source Cassandra itself, very much like Compaction. As I write this 
>> email, we are internally working on a one-pager proposal doc to all the 
>> community members on having a repair inside the OSS Apache Cassandra along 
>> with our private fork - I will share it soon.
>> 
>> Generally, we are ok with any solution getting adopted (either Joey's 
>> solution or our repair solution or any other solution). The primary 
>> motivation is to have the repair embedded inside the open-source Cassandra 
>> itself, so we can retire all various privately developed solutions 
>> eventually :)
>> 
>> I am also happy to help (drive conversation, discussion, etc.) in any way to 
>> have a repair solution adopted inside Cassandra itself, please let me know. 
>> Happy to help!
>> 
>> Yours Faithfully,
>> Jaydeep
>> 
>> On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev 
>> mailto:dev@cassandra.apache.org>> wrote:
>>> All,
>>> 
>>> We had a brief discussion in [2] about the Uber article [1] where they talk 
>>> about having integrated repair into Cassandra and how great that is. I 
>>> expressed my disappointment that they didn't work with the community on 
>>> that (Uber, if you are listening time to make amends ) and it turns out 
>>> Joey already had the idea and wrote the code [3] - so I wanted to start a 
>>> discussion to gauge interest and maybe how to revive that effort.
>>> 
>>> Thanks,
>>> German
>>> 
>>> [1] 
>>> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
>>> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346



Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
To clarify the repair solution timing, the one we have listed in the
article is not the recently developed one. We were hitting some
high-priority production challenges back in early 2018, and to address
that, we developed and rolled out the solution in production in just a few
months. The timing-wise, the solution was developed and productized by Q3
2018, of course, continued to evolve thereafter. Usually, we explore the
existing solutions we can leverage, but when we started our journey in
early 2018, most of the solutions were based on sidecar solutions. There is
nothing against the sidecar solution; it was just a pure business decision,
and in that, we wanted to avoid the sidecar to avoid a dependency on the
control plane. Every solution developed has its deep context, merits, and
pros and cons; they are all great solutions!

An appeal to the community members is to think one more time about having
repairs in the Open Source Cassandra itself. As mentioned in my previous
email, any solution getting adopted is fine; the important aspect is to
have a repair solution in the OSS Cassandra itself!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia 
wrote:

> Hi German,
>
> The goal is always to backport our learnings back to the community. For
> example, I have already successfully backported the following two
> enhancements/bug fixes back to the Open Source Cassandra, which are
> described in the article. I am already currently working on open-source a
> few more enhancements mentioned in the article back to the open-source.
>
>1. https://issues.apache.org/jira/browse/CASSANDRA-18555
>2. https://issues.apache.org/jira/browse/CASSANDRA-13740
>
> There is definitely heavy interest in having the repair solution inside
> the Open Source Cassandra itself, very much like Compaction. As I write
> this email, we are internally working on a one-pager proposal doc to all
> the community members on having a repair inside the OSS Apache Cassandra
> along with our private fork - I will share it soon.
>
> Generally, we are ok with any solution getting adopted (either Joey's
> solution or our repair solution or any other solution). The primary
> motivation is to have the repair embedded inside the open-source Cassandra
> itself, so we can retire all various privately developed solutions
> eventually :)
>
> I am also happy to help (drive conversation, discussion, etc.) in any way
> to have a repair solution adopted inside Cassandra itself, please let me
> know. Happy to help!
>
> Yours Faithfully,
> Jaydeep
>
> On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
> dev@cassandra.apache.org> wrote:
>
>> All,
>>
>> We had a brief discussion in [2] about the Uber article [1] where they
>> talk about having integrated repair into Cassandra and how great that is. I
>> expressed my disappointment that they didn't work with the community on
>> that (Uber, if you are listening time to make amends ) and it turns out
>> Joey already had the idea and wrote the code [3] - so I wanted to start a
>> discussion to gauge interest and maybe how to revive that effort.
>>
>> Thanks,
>> German
>>
>> [1]
>> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
>> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>>
>


Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
Hi German,

The goal is always to backport our learnings back to the community. For
example, I have already successfully backported the following two
enhancements/bug fixes back to the Open Source Cassandra, which are
described in the article. I am already currently working on open-source a
few more enhancements mentioned in the article back to the open-source.

   1. https://issues.apache.org/jira/browse/CASSANDRA-18555
   2. https://issues.apache.org/jira/browse/CASSANDRA-13740

There is definitely heavy interest in having the repair solution inside the
Open Source Cassandra itself, very much like Compaction. As I write this
email, we are internally working on a one-pager proposal doc to all the
community members on having a repair inside the OSS Apache Cassandra along
with our private fork - I will share it soon.

Generally, we are ok with any solution getting adopted (either Joey's
solution or our repair solution or any other solution). The primary
motivation is to have the repair embedded inside the open-source Cassandra
itself, so we can retire all various privately developed solutions
eventually :)

I am also happy to help (drive conversation, discussion, etc.) in any way
to have a repair solution adopted inside Cassandra itself, please let me
know. Happy to help!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
dev@cassandra.apache.org> wrote:

> All,
>
> We had a brief discussion in [2] about the Uber article [1] where they
> talk about having integrated repair into Cassandra and how great that is. I
> expressed my disappointment that they didn't work with the community on
> that (Uber, if you are listening time to make amends ) and it turns out
> Joey already had the idea and wrote the code [3] - so I wanted to start a
> discussion to gauge interest and maybe how to revive that effort.
>
> Thanks,
> German
>
> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>


[Discuss] Repair inside C*

2023-07-24 Thread German Eichberger via dev
All,

We had a brief discussion in [2] about the Uber article [1] where they talk 
about having integrated repair into Cassandra and how great that is. I 
expressed my disappointment that they didn't work with the community on that 
(Uber, if you are listening time to make amends ) and it turns out Joey 
already had the idea and wrote the code [3] - so I wanted to start a discussion 
to gauge interest and maybe how to revive that effort.

Thanks,
German

[1] https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
[2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
[3] https://issues.apache.org/jira/browse/CASSANDRA-14346