Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
Thanks for the information, Yifan and James!Given that, we can scope this email discussion only for this specific MV repair. Two points:1. Can this MV repair job provide some value addition?2. If yes, does it even make sense to merge this MV repair tooling, which uses Spak as its underlying technology, with Cassandra Analytics?JaydeepOn Dec 6, 2024, at 3:58 PM, Yifan Cai  wrote:Oh, I just noticed that James already mentioned it. On Fri, Dec 6, 2024 at 3:51 PM Yifan Cai  wrote:I would like to highlight an existing tooling for "many things beyond the MV work, such as counting rows, etc." The Apache Cassandra Analytics project (http://github.com/apache/cassandra-analytics/) could be a great resource for this type of task. It reads directly from the SSTables in the Spark executors, which avoids sending CQL queries that cloud stress the cluster or interfere with the production traffic. - YifanOn Fri, Dec 6, 2024 at 8:27 AM Jaydeep Chovatia  wrote:Hi,NOTE: This email does not promote using Cassandra's Materialized View (MV) but assists those stuck with it for various reasons.The primary issue with MV is that once it goes out of sync with the base table, no tooling is available to remediate it. This Spark job aims to fill this gap by logically comparing the MV with the base table and identifying inconsistencies. The job primarily does the following:Scans Base Table (A), MV (B), and do {A}-{B} analysisCategorize each record into one of the four areas: a) Consistent, b) Inconsistent, c) MissingInMV, d) MissingInBaseTableProvide a detailed view of mismatches, such as the primary key, all the non-primary key fields, and mismatched columns.Dumps the detailed information to an output folder path provided to the job (one can extend the interface to dump the records to some object store as well)Optionally, the job fixes the MV inconsistencies.Rich configuration (throttling, actionable output, capability to specify the time range for the records, etc.) to run the job at Scale in a production environmentDesign doc: linkThe Git Repository: linkMotivationThis email's primary objective is to share with the community that something like this is available for MV (in a private repository), which may be helpful in emergencies to folks stuck with MV in production.If we, as a community, want to officially foster tooling using Spark because it can be helpful to do many things beyond the MV work, such as counting rows, etc., then I am happy to drive the efforts.Please let me know what you think.Jaydeep











Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Yifan Cai
Oh, I just noticed that James already mentioned it.

On Fri, Dec 6, 2024 at 3:51 PM Yifan Cai  wrote:

> I would like to highlight an existing tooling for "many things beyond the
> MV work, such as counting rows, etc."
>
> The Apache Cassandra Analytics project (
> http://github.com/apache/cassandra-analytics/) could be a great resource
> for this type of task. It reads directly from the SSTables in the Spark
> executors, which avoids sending CQL queries that cloud stress the cluster
> or interfere with the production traffic.
>
> - Yifan
>
> On Fri, Dec 6, 2024 at 8:27 AM Jaydeep Chovatia <
> [email protected]> wrote:
>
>> Hi,
>>
>> *NOTE: *This email does not promote using Cassandra's Materialized View
>> (MV) but assists those stuck with it for various reasons.
>>
>> The primary issue with MV is that once it goes out of sync with the base
>> table, no tooling is available to remediate it. This Spark job aims to fill
>> this gap by logically comparing the MV with the base table and identifying
>> inconsistencies. The job primarily does the following:
>>
>>- Scans Base Table (A), MV (B), and do {A}-{B} analysis
>>- Categorize each record into one of the four areas: a) Consistent,
>>b) Inconsistent, c) MissingInMV, d) MissingInBaseTable
>>- Provide a detailed view of mismatches, such as the primary key, all
>>the non-primary key fields, and mismatched columns.
>>- Dumps the detailed information to an output folder path provided to
>>the job (one can extend the interface to dump the records to some object
>>store as well)
>>- Optionally, the job fixes the MV inconsistencies.
>>- Rich configuration (throttling, actionable output, capability to
>>specify the time range for the records, etc.) to run the job at Scale in a
>>production environment
>>
>> Design doc: link
>> 
>> The Git Repository: link
>> 
>>
>> *Motivation*
>>
>>1. This email's primary objective is to share with the community that
>>something like this is available for MV (in a private repository), which
>>may be helpful in emergencies to folks stuck with MV in production.
>>2. If we, as a community, want to officially foster tooling using
>>Spark because it can be helpful to do many things beyond the MV work, such
>>as counting rows, etc., then I am happy to drive the efforts.
>>
>> Please let me know what you think.
>>
>> Jaydeep
>>
>


Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Yifan Cai
I would like to highlight an existing tooling for "many things beyond the
MV work, such as counting rows, etc."

The Apache Cassandra Analytics project (
http://github.com/apache/cassandra-analytics/) could be a great resource
for this type of task. It reads directly from the SSTables in the Spark
executors, which avoids sending CQL queries that cloud stress the cluster
or interfere with the production traffic.

- Yifan

On Fri, Dec 6, 2024 at 8:27 AM Jaydeep Chovatia 
wrote:

> Hi,
>
> *NOTE: *This email does not promote using Cassandra's Materialized View
> (MV) but assists those stuck with it for various reasons.
>
> The primary issue with MV is that once it goes out of sync with the base
> table, no tooling is available to remediate it. This Spark job aims to fill
> this gap by logically comparing the MV with the base table and identifying
> inconsistencies. The job primarily does the following:
>
>- Scans Base Table (A), MV (B), and do {A}-{B} analysis
>- Categorize each record into one of the four areas: a) Consistent, b)
>Inconsistent, c) MissingInMV, d) MissingInBaseTable
>- Provide a detailed view of mismatches, such as the primary key, all
>the non-primary key fields, and mismatched columns.
>- Dumps the detailed information to an output folder path provided to
>the job (one can extend the interface to dump the records to some object
>store as well)
>- Optionally, the job fixes the MV inconsistencies.
>- Rich configuration (throttling, actionable output, capability to
>specify the time range for the records, etc.) to run the job at Scale in a
>production environment
>
> Design doc: link
> 
> The Git Repository: link
> 
>
> *Motivation*
>
>1. This email's primary objective is to share with the community that
>something like this is available for MV (in a private repository), which
>may be helpful in emergencies to folks stuck with MV in production.
>2. If we, as a community, want to officially foster tooling using
>Spark because it can be helpful to do many things beyond the MV work, such
>as counting rows, etc., then I am happy to drive the efforts.
>
> Please let me know what you think.
>
> Jaydeep
>


Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
There are two approaches I have been thinking about for MV.

*1. **Short Term (**Status Quo)*
Here, we do not improve Cassandra MV architecture such that it reduces the
data inconsistencies drastically; thus, we continually mark MV as an
experimental feature.

In this case, we can have two suboptions to make the data consistent
eventually.

   1. Use external tools (such as Spark)
   2. Do it internally within Cassandra

*2. **Long Term (**Fix MV Architecturally)*
@[email protected]  and I have been discussing a few
strategies to solve the fundamental issues with the current MV
architecture, such as reducing the possibility of inconsistency by an order
of magnitude. We are considering solving this within the DB itself.

tll;dr

For #1, using external frameworks, like Spark, makes it extremely easy
because we need to dump all the data in Memory for the Base table and MV to
do {A}-{B} records comparison. Not that it is not feasible within
Cassandra, but it is challenging, and we might end up creating the
mini-Spark type of application within Cassandra.
So, the current thinking is for #1; we rely on external frameworks to make
quick progress there so something is available in emergencies and divest
the energy towards #2.

Jaydeep

On Fri, Dec 6, 2024 at 10:28 AM Jeff Jirsa  wrote:

> It feels uncomfortable asking users to rely on a third party that’s as
> heavy-weight as spark to use a built-in feature.
>
> Can we really not do this internally? I get that the obvious way with
> merkle trees is hard because the range fanout of the MV using a different
> partitioner, but have we tried to think up a way to do this (somewhat
> efficiently) within the db?
>
>
> On Dec 6, 2024, at 9:08 AM, James Berragan  wrote:
>
> I think this would be useful and - having never really used Materialized
> Views - I didn't know it was an issue for some users. I would say the
> Cassandra Analytics library (http://github.com/apache/cassandra-analytics/)
> could be utilized for much of this, with a specialized Spark job for this
> purpose.
>
> On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia 
> wrote:
>
>> Hi,
>>
>> *NOTE: *This email does not promote using Cassandra's Materialized View
>> (MV) but assists those stuck with it for various reasons.
>>
>> The primary issue with MV is that once it goes out of sync with the base
>> table, no tooling is available to remediate it. This Spark job aims to fill
>> this gap by logically comparing the MV with the base table and identifying
>> inconsistencies. The job primarily does the following:
>>
>>- Scans Base Table (A), MV (B), and do {A}-{B} analysis
>>- Categorize each record into one of the four areas: a) Consistent,
>>b) Inconsistent, c) MissingInMV, d) MissingInBaseTable
>>- Provide a detailed view of mismatches, such as the primary key, all
>>the non-primary key fields, and mismatched columns.
>>- Dumps the detailed information to an output folder path provided to
>>the job (one can extend the interface to dump the records to some object
>>store as well)
>>- Optionally, the job fixes the MV inconsistencies.
>>- Rich configuration (throttling, actionable output, capability to
>>specify the time range for the records, etc.) to run the job at Scale in a
>>production environment
>>
>> Design doc: link
>> 
>> The Git Repository: link
>> 
>>
>> *Motivation*
>>
>>1. This email's primary objective is to share with the community that
>>something like this is available for MV (in a private repository), which
>>may be helpful in emergencies to folks stuck with MV in production.
>>2. If we, as a community, want to officially foster tooling using
>>Spark because it can be helpful to do many things beyond the MV work, such
>>as counting rows, etc., then I am happy to drive the efforts.
>>
>> Please let me know what you think.
>>
>> Jaydeep
>>
>
>


Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jeff Jirsa
It feels uncomfortable asking users to rely on a third party that’s as 
heavy-weight as spark to use a built-in feature.

Can we really not do this internally? I get that the obvious way with merkle 
trees is hard because the range fanout of the MV using a different partitioner, 
but have we tried to think up a way to do this (somewhat efficiently) within 
the db? 


> On Dec 6, 2024, at 9:08 AM, James Berragan  wrote:
> 
> I think this would be useful and - having never really used Materialized 
> Views - I didn't know it was an issue for some users. I would say the 
> Cassandra Analytics library (http://github.com/apache/cassandra-analytics/) 
> could be utilized for much of this, with a specialized Spark job for this 
> purpose.
> 
> On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia  > wrote:
>> Hi,
>> 
>> NOTE: This email does not promote using Cassandra's Materialized View (MV) 
>> but assists those stuck with it for various reasons.
>> 
>> The primary issue with MV is that once it goes out of sync with the base 
>> table, no tooling is available to remediate it. This Spark job aims to fill 
>> this gap by logically comparing the MV with the base table and identifying 
>> inconsistencies. The job primarily does the following:
>> Scans Base Table (A), MV (B), and do {A}-{B} analysis
>> Categorize each record into one of the four areas: a) Consistent, b) 
>> Inconsistent, c) MissingInMV, d) MissingInBaseTable
>> Provide a detailed view of mismatches, such as the primary key, all the 
>> non-primary key fields, and mismatched columns.
>> Dumps the detailed information to an output folder path provided to the job 
>> (one can extend the interface to dump the records to some object store as 
>> well)
>> Optionally, the job fixes the MV inconsistencies.
>> Rich configuration (throttling, actionable output, capability to specify the 
>> time range for the records, etc.) to run the job at Scale in a production 
>> environment
>> Design doc: link 
>> 
>> The Git Repository: link 
>> 
>> 
>> Motivation
>> This email's primary objective is to share with the community that something 
>> like this is available for MV (in a private repository), which may be 
>> helpful in emergencies to folks stuck with MV in production.
>> If we, as a community, want to officially foster tooling using Spark because 
>> it can be helpful to do many things beyond the MV work, such as counting 
>> rows, etc., then I am happy to drive the efforts.
>> Please let me know what you think.
>> 
>> Jaydeep



Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread James Berragan
I think this would be useful and - having never really used Materialized
Views - I didn't know it was an issue for some users. I would say the
Cassandra Analytics library (http://github.com/apache/cassandra-analytics/)
could be utilized for much of this, with a specialized Spark job for this
purpose.

On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia 
wrote:

> Hi,
>
> *NOTE: *This email does not promote using Cassandra's Materialized View
> (MV) but assists those stuck with it for various reasons.
>
> The primary issue with MV is that once it goes out of sync with the base
> table, no tooling is available to remediate it. This Spark job aims to fill
> this gap by logically comparing the MV with the base table and identifying
> inconsistencies. The job primarily does the following:
>
>- Scans Base Table (A), MV (B), and do {A}-{B} analysis
>- Categorize each record into one of the four areas: a) Consistent, b)
>Inconsistent, c) MissingInMV, d) MissingInBaseTable
>- Provide a detailed view of mismatches, such as the primary key, all
>the non-primary key fields, and mismatched columns.
>- Dumps the detailed information to an output folder path provided to
>the job (one can extend the interface to dump the records to some object
>store as well)
>- Optionally, the job fixes the MV inconsistencies.
>- Rich configuration (throttling, actionable output, capability to
>specify the time range for the records, etc.) to run the job at Scale in a
>production environment
>
> Design doc: link
> 
> The Git Repository: link
> 
>
> *Motivation*
>
>1. This email's primary objective is to share with the community that
>something like this is available for MV (in a private repository), which
>may be helpful in emergencies to folks stuck with MV in production.
>2. If we, as a community, want to officially foster tooling using
>Spark because it can be helpful to do many things beyond the MV work, such
>as counting rows, etc., then I am happy to drive the efforts.
>
> Please let me know what you think.
>
> Jaydeep
>