Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-08-05 Thread Jaydeep Chovatia
>What if the CEP includes an interface for MV repair that calls out to some
user pluggable solution and the spark-based solution you've developed is
the first / only reference solution available at the outset? That way we
could integrate it into the control plane (nodetool, JMX, a post CEP-37
world), have a solution available for those comfortable taking on the spark
dependency, and have a clear paved path for future anti-entropy work on
where to fit into the ecosystem.

It is acceptable to include an orchestration mechanism as a stretch goal of
this CEP, which would provide the following capabilities:

   1.

   Trigger MV Repair: Invoke a pluggable JAR to start the MV repair process
   via a method such as:

String invokeMVRepair(String keyspace, String baseTable, String mvTable,
List options)

   2.

   Track Job Status:  Expose an API to query the status of the repair job:

JobStatus jobStatus(String ID)

This orchestration framework could be built on top of the existing
AutoRepair infrastructure from CEP-37, and extended within Cassandra to
manage MV-specific repair workflows. It would process one or more MVs at a
time by invoking the repair via API #1 and tracking the status via API #2. The
default implementation can be a noop, allowing users to plug in their own
logic. This design leaves room for integration with external repair
solutions, such as the cassandra-mv-repair-spark-job
, or for
users to build custom repair mechanisms.
However, it's important to note that the cassandra-mv-repair-spark-job
 itself
is explicitly out of scope for this CEP, as it has already been addressed
in a separate discussion thread
.

Jaydeep

On Fri, Aug 1, 2025 at 6:42 AM Josh McKenzie  wrote:

> Definitely want to avoid scope creep, *however*... ;)
>
> What if the CEP includes an interface for MV repair that calls out to some
> user pluggable solution and the spark-based solution you've developed is
> the first / only reference solution available at the outset? That way we
> could integrate it into the control plane (nodetool, JMX, a post CEP-37
> world), have a solution available for those comfortable taking on the spark
> dependency, and have a clear paved path for future anti-entropy work on
> where to fit into the ecosystem.
>
> On Thu, Jul 31, 2025, at 5:20 PM, Runtian Liu wrote:
>
> Based on our discussion, it seems we’ve reached broad agreement on the
> hot‑path optimization for strict MV consistency mode. We still have
> concerns about the different approaches for the repair path. I’ve updated
> the CEP’s scope and retitled it *‘Cassandra Materialized View
> Consistency, Reliability & Backfill Enhancement’.* I removed the repair
> section and added a backfill section that proposes an optimized strategy to
> backfill MVs from the base table. Please take a look and share your
> thoughts. I believe this backfill strategy will also benefit future MV
> repair work.
>
> Regarding repair,
>
> While repair is essential to maintain Materialized View (MV) consistency
> in the face of one-off bugs or bit rot, implementing a robust, native
> repair mechanism within Apache Cassandra remains a highly complex and still
> largely theoretical challenge, irrespective of the strategies considered in
> prior discussions.
>
> To ensure steady progress, we propose addressing the problem
> incrementally. This CEP focuses on making the hot-path & backfill reliable,
> laying a solid foundation. A future CEP can then address repair mechanisms
> within the Apache Cassandra ecosystem in a more focused and well-scoped
> manner.
> For now, the proposed CEP clarifies that users may choose to either
> evaluate the external Spark-based repair tool
> 
> (which is not part of this proposal) or develop their own repair solution
> tailored to their operational needs.
>
> On Thu, Jul 3, 2025 at 4:32 PM Runtian Liu  wrote:
>
> I believe that, with careful design, we could make the row-level repair
> work using the index-based approach.
> However, regarding the CEP-proposed solution, I want to highlight that the
> entire repair job can be divided into two parts:
>
>1.
>
>Inconsistency detection
>2.
>
>Rebuilding the inconsistent ranges
>
> The second step could look like this:
>
> nodetool mv_rebuild --base_table_range  --mv_ranges
> [, , ...]
>
>
> The base table range and MV ranges would come from the first step. But
> it’s also possible to run this rebuild directly—without detecting
> inconsistencies—if we already know that some nodes need repair. This means
> we don’t have to wait for the snapshot processing to start the repair.
> Snapshot-based detection works well when the cluster is generally healthy.
>
> Compared to the current rebuild, the proposed approach does

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-08-01 Thread Josh McKenzie
Definitely want to avoid scope creep, *however*... ;)

What if the CEP includes an interface for MV repair that calls out to some user 
pluggable solution and the spark-based solution you've developed is the first / 
only reference solution available at the outset? That way we could integrate it 
into the control plane (nodetool, JMX, a post CEP-37 world), have a solution 
available for those comfortable taking on the spark dependency, and have a 
clear paved path for future anti-entropy work on where to fit into the 
ecosystem.

On Thu, Jul 31, 2025, at 5:20 PM, Runtian Liu wrote:
> Based on our discussion, it seems we’ve reached broad agreement on the 
> hot‑path optimization for strict MV consistency mode. We still have concerns 
> about the different approaches for the repair path. I’ve updated the CEP’s 
> scope and retitled it *‘Cassandra Materialized View Consistency, Reliability 
> & Backfill Enhancement’.* I removed the repair section and added a backfill 
> section that proposes an optimized strategy to backfill MVs from the base 
> table. Please take a look and share your thoughts. I believe this backfill 
> strategy will also benefit future MV repair work.
> 
> Regarding repair,
> 
> While repair is essential to maintain Materialized View (MV) consistency in 
> the face of one-off bugs or bit rot, implementing a robust, native repair 
> mechanism within Apache Cassandra remains a highly complex and still largely 
> theoretical challenge, irrespective of the strategies considered in prior 
> discussions.
> 
> To ensure steady progress, we propose addressing the problem incrementally. 
> This CEP focuses on making the hot-path & backfill reliable, laying a solid 
> foundation. A future CEP can then address repair mechanisms within the Apache 
> Cassandra ecosystem in a more focused and well-scoped manner. 
> 
> For now, the proposed CEP clarifies that users may choose to either evaluate 
> the external Spark-based repair tool 
>  (which is 
> not part of this proposal) or develop their own repair solution tailored to 
> their operational needs.
> 
> On Thu, Jul 3, 2025 at 4:32 PM Runtian Liu  wrote:
>> I believe that, with careful design, we could make the row-level repair work 
>> using the index-based approach.
>> However, regarding the CEP-proposed solution, I want to highlight that the 
>> entire repair job can be divided into two parts:
>> 
>>  1. Inconsistency detection
>> 
>>  2. Rebuilding the inconsistent ranges
>> 
>> The second step could look like this:
>> 
>> nodetool mv_rebuild --base_table_range  --mv_ranges 
>> [, , ...]
>> 
>> 
>> 
>> The base table range and MV ranges would come from the first step. But it’s 
>> also possible to run this rebuild directly—without detecting 
>> inconsistencies—if we already know that some nodes need repair. This means 
>> we don’t have to wait for the snapshot processing to start the repair. 
>> Snapshot-based detection works well when the cluster is generally healthy.
>> 
>> Compared to the current rebuild, the proposed approach doesn’t require 
>> dropping the MV and rebuilding it from scratch across the entire 
>> cluster—which is a major blocker for production environments.
>> 
>> That said, for the index-based approach, I still think it introduces 
>> additional load on both the write path and compaction. This increases the 
>> resource cost of using MVs. While it’s a nice feature to enable row-level MV 
>> repair, its implementation is complex. On the other hand, the original 
>> proposal’s rebuild strategy is more versatile and likely to perform 
>> better—especially when only a few nodes need repair. In contrast, 
>> large-scale row-level comparisons via the index-based approach could be 
>> prohibitively expensive. Just to clarify my intention here: this is not a 
>> complete 'no' to the index-based approach. However, for the initial version, 
>> I believe it's more prudent to avoid impacting the hot path. We can start 
>> with a simpler design that keeps the critical path untouched, and then 
>> iterate to make it more robust over time—much like how other features in 
>> Apache Cassandra have evolved. For example, 'full repair' came first, 
>> followed by 'incremental repair'; similarly, STCS was later complemented by 
>> LCS. This phased evolution allows us to balance safety, stability, and 
>> long-term capability.
>> 
>> Given all this, I still prefer to move forward with the original proposal. 
>> It allows us to avoid paying the repair overhead most of the time.
>> 
>> 
>> 
>> On Tue, Jun 24, 2025 at 3:25 PM Blake Eggleston  wrote:
>>> __
>>> Those are both fair points. Once you start dealing with data loss though, 
>>> maintaining guarantees is often impossible, so I’m not sure that torn 
>>> writes or updated timestamps are dealbreakers, but I’m fine with tabling 
>>> option 2 for now and seeing if we can figure out something better.
>>> 
>>> Regarding the assassin

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-07-31 Thread Runtian Liu
Based on our discussion, it seems we’ve reached broad agreement on the
hot‑path optimization for strict MV consistency mode. We still have
concerns about the different approaches for the repair path. I’ve updated
the CEP’s scope and retitled it ‘Cassandra Materialized View Consistency,
Reliability & Backfill Enhancement’. I removed the repair section and added
a backfill section that proposes an optimized strategy to backfill MVs from
the base table. Please take a look and share your thoughts. I believe this
backfill strategy will also benefit future MV repair work.

Regarding repair,

While repair is essential to maintain Materialized View (MV) consistency in
the face of one-off bugs or bit rot, implementing a robust, native repair
mechanism within Apache Cassandra remains a highly complex and still
largely theoretical challenge, irrespective of the strategies considered in
prior discussions.

To ensure steady progress, we propose addressing the problem incrementally.
This CEP focuses on making the hot-path & backfill reliable, laying a solid
foundation. A future CEP can then address repair mechanisms within the
Apache Cassandra ecosystem in a more focused and well-scoped manner.
For now, the proposed CEP clarifies that users may choose to either
evaluate the external Spark-based repair tool
 (which
is not part of this proposal) or develop their own repair solution tailored
to their operational needs.

On Thu, Jul 3, 2025 at 4:32 PM Runtian Liu  wrote:

> I believe that, with careful design, we could make the row-level repair
> work using the index-based approach.
> However, regarding the CEP-proposed solution, I want to highlight that the
> entire repair job can be divided into two parts:
>
>1.
>
>Inconsistency detection
>
>2.
>
>Rebuilding the inconsistent ranges
>
>
> The second step could look like this:
>
> nodetool mv_rebuild --base_table_range  --mv_ranges
> [, , ...]
>
>
> The base table range and MV ranges would come from the first step. But
> it’s also possible to run this rebuild directly—without detecting
> inconsistencies—if we already know that some nodes need repair. This means
> we don’t have to wait for the snapshot processing to start the repair.
> Snapshot-based detection works well when the cluster is generally healthy.
>
> Compared to the current rebuild, the proposed approach doesn’t require
> dropping the MV and rebuilding it from scratch across the entire
> cluster—which is a major blocker for production environments.
>
> That said, for the index-based approach, I still think it introduces
> additional load on both the write path and compaction. This increases the
> resource cost of using MVs. While it’s a nice feature to enable row-level
> MV repair, its implementation is complex. On the other hand, the original
> proposal’s rebuild strategy is more versatile and likely to perform
> better—especially when only a few nodes need repair. In contrast,
> large-scale row-level comparisons via the index-based approach could be
> prohibitively expensive. Just to clarify my intention here: this is not a
> complete 'no' to the index-based approach. However, for the initial
> version, I believe it's more prudent to avoid impacting the hot path. We
> can start with a simpler design that keeps the critical path untouched, and
> then iterate to make it more robust over time—much like how other features
> in Apache Cassandra have evolved. For example, 'full repair' came first,
> followed by 'incremental repair'; similarly, STCS was later complemented by
> LCS. This phased evolution allows us to balance safety, stability, and
> long-term capability.
>
> Given all this, I still prefer to move forward with the original proposal.
> It allows us to avoid paying the repair overhead most of the time.
>
>
> On Tue, Jun 24, 2025 at 3:25 PM Blake Eggleston 
> wrote:
>
>> Those are both fair points. Once you start dealing with data loss though,
>> maintaining guarantees is often impossible, so I’m not sure that torn
>> writes or updated timestamps are dealbreakers, but I’m fine with tabling
>> option 2 for now and seeing if we can figure out something better.
>>
>> Regarding the assassin cells, if you wanted to avoid explicitly agreeing
>> on a value, you might be able to only issue them for repaired base data,
>> which has been implicitly agreed upon.
>>
>> I think that or something like it is worth exploring. The idea would be
>> to solve this issue as completely as anti-compaction would - but without
>> having to rewrite sstables. I’d be interested to hear any ideas you have
>> about how that might work.
>>
>> You basically need a mechanism to erase some piece of data that was
>> written before a given wall clock time - regardless of cell timestamp, and
>> without precluding future updates (in wall clock time) with earlier
>> timestamps.
>>
>> On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote:
>>
>> In the second option

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-07-03 Thread Runtian Liu
I believe that, with careful design, we could make the row-level repair
work using the index-based approach.
However, regarding the CEP-proposed solution, I want to highlight that the
entire repair job can be divided into two parts:

   1.

   Inconsistency detection

   2.

   Rebuilding the inconsistent ranges


The second step could look like this:

nodetool mv_rebuild --base_table_range  --mv_ranges
[, , ...]


The base table range and MV ranges would come from the first step. But it’s
also possible to run this rebuild directly—without detecting
inconsistencies—if we already know that some nodes need repair. This means
we don’t have to wait for the snapshot processing to start the repair.
Snapshot-based detection works well when the cluster is generally healthy.

Compared to the current rebuild, the proposed approach doesn’t require
dropping the MV and rebuilding it from scratch across the entire
cluster—which is a major blocker for production environments.

That said, for the index-based approach, I still think it introduces
additional load on both the write path and compaction. This increases the
resource cost of using MVs. While it’s a nice feature to enable row-level
MV repair, its implementation is complex. On the other hand, the original
proposal’s rebuild strategy is more versatile and likely to perform
better—especially when only a few nodes need repair. In contrast,
large-scale row-level comparisons via the index-based approach could be
prohibitively expensive. Just to clarify my intention here: this is not a
complete 'no' to the index-based approach. However, for the initial
version, I believe it's more prudent to avoid impacting the hot path. We
can start with a simpler design that keeps the critical path untouched, and
then iterate to make it more robust over time—much like how other features
in Apache Cassandra have evolved. For example, 'full repair' came first,
followed by 'incremental repair'; similarly, STCS was later complemented by
LCS. This phased evolution allows us to balance safety, stability, and
long-term capability.

Given all this, I still prefer to move forward with the original proposal.
It allows us to avoid paying the repair overhead most of the time.


On Tue, Jun 24, 2025 at 3:25 PM Blake Eggleston 
wrote:

> Those are both fair points. Once you start dealing with data loss though,
> maintaining guarantees is often impossible, so I’m not sure that torn
> writes or updated timestamps are dealbreakers, but I’m fine with tabling
> option 2 for now and seeing if we can figure out something better.
>
> Regarding the assassin cells, if you wanted to avoid explicitly agreeing
> on a value, you might be able to only issue them for repaired base data,
> which has been implicitly agreed upon.
>
> I think that or something like it is worth exploring. The idea would be to
> solve this issue as completely as anti-compaction would - but without
> having to rewrite sstables. I’d be interested to hear any ideas you have
> about how that might work.
>
> You basically need a mechanism to erase some piece of data that was
> written before a given wall clock time - regardless of cell timestamp, and
> without precluding future updates (in wall clock time) with earlier
> timestamps.
>
> On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote:
>
> In the second option, we use the repair timestamp to re-update any cell or
> row we want to fix in the base table. This approach is problematic because
> it alters the write time of user-supplied data. Although Cassandra does not
> allow users to set timestamps for LWT writes, users may still rely on the
> update time. A key limitation of this approach is that it cannot fix cases
> where a view cell ends up in a future state while the base table remains
> correct. I now understand your point that Cassandra cannot handle this
> scenario today. However, as I mentioned earlier, the important distinction
> is that when this issue occurs in the base table, we accept the "incorrect"
> data as valid—but this is not acceptable for materialized views, since the
> source of truth (the base table) still holds the correct data.
>
> On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston 
> wrote:
>
>
> > Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> No worries, I’ll be taking some time off soon as well.
>
> > I don’t think the first or second option is ideal. We should treat the
> base table as the source of truth. Modifying it—or forcing an update on it,
> even if it’s just a timestamp change—is not a good approach and won’t solve
> all problems.
>
> I agree the first option probably isn’t the right way to go. Could you say
> a bit more about why the second option is not a good approach and which
> problems it won’t solve?
>
> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>
> Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> > First - we interpret view data with higher timestamps than the 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-24 Thread Blake Eggleston
Those are both fair points. Once you start dealing with data loss though, 
maintaining guarantees is often impossible, so I’m not sure that torn writes or 
updated timestamps are dealbreakers, but I’m fine with tabling option 2 for now 
and seeing if we can figure out something better.

Regarding the assassin cells, if you wanted to avoid explicitly agreeing on a 
value, you might be able to only issue them for repaired base data, which has 
been implicitly agreed upon.

I think that or something like it is worth exploring. The idea would be to 
solve this issue as completely as anti-compaction would - but without having to 
rewrite sstables. I’d be interested to hear any ideas you have about how that 
might work.

You basically need a mechanism to erase some piece of data that was written 
before a given wall clock time - regardless of cell timestamp, and without 
precluding future updates (in wall clock time) with earlier timestamps.

On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote:
> In the second option, we use the repair timestamp to re-update any cell or 
> row we want to fix in the base table. This approach is problematic because it 
> alters the write time of user-supplied data. Although Cassandra does not 
> allow users to set timestamps for LWT writes, users may still rely on the 
> update time. A key limitation of this approach is that it cannot fix cases 
> where a view cell ends up in a future state while the base table remains 
> correct. I now understand your point that Cassandra cannot handle this 
> scenario today. However, as I mentioned earlier, the important distinction is 
> that when this issue occurs in the base table, we accept the "incorrect" data 
> as valid—but this is not acceptable for materialized views, since the source 
> of truth (the base table) still holds the correct data.
> 
> On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston  wrote:
>> __
>> > Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>> > sooner.
>> 
>> No worries, I’ll be taking some time off soon as well.
>> 
>> > I don’t think the first or second option is ideal. We should treat the 
>> > base table as the source of truth. Modifying it—or forcing an update on 
>> > it, even if it’s just a timestamp change—is not a good approach and won’t 
>> > solve all problems.
>> 
>> I agree the first option probably isn’t the right way to go. Could you say a 
>> bit more about why the second option is not a good approach and which 
>> problems it won’t solve?
>> 
>> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>>> Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>>> sooner.
>>> 
>>> > First - we interpret view data with higher timestamps than the base table 
>>> > as data that’s missing from the base and replicate it into the base 
>>> > table. The timestamp of the missing data may be below the paxos timestamp 
>>> > low bound so we’d have to adjust the paxos coordination logic to allow 
>>> > that in this case. Depending on how the view got this way it may also 
>>> > tear writes to the base table, breaking the write atomicity promise.
>>> 
>>> As discussed earlier, we want this MV repair mechanism to handle all edge 
>>> cases. However, it would be difficult to design it in a way that detects 
>>> the root cause of each mismatch and repairs it accordingly. Additionally, 
>>> as you mentioned, this approach could introduce other issues, such as 
>>> violating the write atomicity guarantee.
>>> 
>>> > Second - If this happens it means that we’ve either lost base table data 
>>> > or paxos metadata. If that happened, we could force a base table update 
>>> > that rewrites the current base state with new timestamps making the extra 
>>> > view data removable. However this wouldn’t fix the case where the view 
>>> > cell has a timestamp from the future - although that’s not a case that C* 
>>> > can fix today either.
>>> 
>>> I don’t think the first or second option is ideal. We should treat the base 
>>> table as the source of truth. Modifying it—or forcing an update on it, even 
>>> if it’s just a timestamp change—is not a good approach and won’t solve all 
>>> problems.
>>> 
>>> > the idea to use anti-compaction makes a lot more sense now (in principle 
>>> > - I don’t think it’s workable in practice)
>>> 
>>> I have one question regarding anti-compaction. Is the main concern that 
>>> processing too much data during anti-compaction could cause issues for the 
>>> cluster? 
>>> 
>>> > I guess you could add some sort of assassin cell that is meant to remove 
>>> > a cell with a specific timestamp and value, but is otherwise invisible. 
>>> 
>>> The idea of the assassination cell is interesting. To prevent data from 
>>> being incorrectly removed during the repair process, we need to ensure a 
>>> quorum of nodes is available and agrees on the same value before repairing 
>>> a materialized view (MV) row or cell. However, this could be very 
>>> expensive, as it re

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-23 Thread Runtian Liu
In the second option, we use the repair timestamp to re-update any cell or
row we want to fix in the base table. This approach is problematic because
it alters the write time of user-supplied data. Although Cassandra does not
allow users to set timestamps for LWT writes, users may still rely on the
update time. A key limitation of this approach is that it cannot fix cases
where a view cell ends up in a future state while the base table remains
correct. I now understand your point that Cassandra cannot handle this
scenario today. However, as I mentioned earlier, the important distinction
is that when this issue occurs in the base table, we accept the "incorrect"
data as valid—but this is not acceptable for materialized views, since the
source of truth (the base table) still holds the correct data.

On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston 
wrote:

> > Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> No worries, I’ll be taking some time off soon as well.
>
> > I don’t think the first or second option is ideal. We should treat the
> base table as the source of truth. Modifying it—or forcing an update on it,
> even if it’s just a timestamp change—is not a good approach and won’t solve
> all problems.
>
> I agree the first option probably isn’t the right way to go. Could you say
> a bit more about why the second option is not a good approach and which
> problems it won’t solve?
>
> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>
> Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> > First - we interpret view data with higher timestamps than the base
> table as data that’s missing from the base and replicate it into the base
> table. The timestamp of the missing data may be below the paxos timestamp
> low bound so we’d have to adjust the paxos coordination logic to allow that
> in this case. Depending on how the view got this way it may also tear
> writes to the base table, breaking the write atomicity promise.
>
> As discussed earlier, we want this MV repair mechanism to handle all edge
> cases. However, it would be difficult to design it in a way that detects
> the root cause of each mismatch and repairs it accordingly. Additionally,
> as you mentioned, this approach could introduce other issues, such as
> violating the write atomicity guarantee.
>
> > Second - If this happens it means that we’ve either lost base table data
> or paxos metadata. If that happened, we could force a base table update
> that rewrites the current base state with new timestamps making the extra
> view data removable. However this wouldn’t fix the case where the view cell
> has a timestamp from the future - although that’s not a case that C* can
> fix today either.
>
> I don’t think the first or second option is ideal. We should treat the
> base table as the source of truth. Modifying it—or forcing an update on it,
> even if it’s just a timestamp change—is not a good approach and won’t solve
> all problems.
>
> > the idea to use anti-compaction makes a lot more sense now (in principle
> - I don’t think it’s workable in practice)
>
> I have one question regarding anti-compaction. Is the main concern that
> processing too much data during anti-compaction could cause issues for the
> cluster?
>
> > I guess you could add some sort of assassin cell that is meant to remove
> a cell with a specific timestamp and value, but is otherwise invisible.
>
> The idea of the assassination cell is interesting. To prevent data from
> being incorrectly removed during the repair process, we need to ensure a
> quorum of nodes is available and agrees on the same value before repairing
> a materialized view (MV) row or cell. However, this could be very
> expensive, as it requires coordination to repair even a single row.
>
> I think there are a few key differences between MV repair and normal
> anti-entropy repair:
>
>1.
>
>For normal anti-entropy repair, there is no single source of truth.
>All replicas act as sources of truth, and eventual consistency is
>achieved by merging data across them. Even if some data is lost or bugs
>result in future timestamps, the replicas will eventually converge on the
>same (possibly incorrect) value, and that value becomes the accepted truth.
>In contrast, MV repair relies on the base table as the source of truth and
>modifies the MV if inconsistencies are detected. This means that in some
>cases, simply merging data won’t resolve the issue, since Cassandra
>resolves conflicts using timestamps and there’s no guarantee the base table
>will always "win" unless we change the MV merging logic—which I’ve been
>trying to avoid.
>2.
>
>I see MV repair as more of an on-demand operation, whereas normal
>anti-entropy repair needs to run regularly. This means we shouldn’t treat
>MV repair the same way as existing repairs. When an operator initiates MV
>repair, they need to ensure t

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-23 Thread Blake Eggleston
> Sorry, Blake—I was traveling last week and couldn’t reply to your email 
> sooner.

No worries, I’ll be taking some time off soon as well.

> I don’t think the first or second option is ideal. We should treat the base 
> table as the source of truth. Modifying it—or forcing an update on it, even 
> if it’s just a timestamp change—is not a good approach and won’t solve all 
> problems.

I agree the first option probably isn’t the right way to go. Could you say a 
bit more about why the second option is not a good approach and which problems 
it won’t solve?

On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
> Sorry, Blake—I was traveling last week and couldn’t reply to your email 
> sooner.
> 
> > First - we interpret view data with higher timestamps than the base table 
> > as data that’s missing from the base and replicate it into the base table. 
> > The timestamp of the missing data may be below the paxos timestamp low 
> > bound so we’d have to adjust the paxos coordination logic to allow that in 
> > this case. Depending on how the view got this way it may also tear writes 
> > to the base table, breaking the write atomicity promise.
> 
> As discussed earlier, we want this MV repair mechanism to handle all edge 
> cases. However, it would be difficult to design it in a way that detects the 
> root cause of each mismatch and repairs it accordingly. Additionally, as you 
> mentioned, this approach could introduce other issues, such as violating the 
> write atomicity guarantee.
> 
> > Second - If this happens it means that we’ve either lost base table data or 
> > paxos metadata. If that happened, we could force a base table update that 
> > rewrites the current base state with new timestamps making the extra view 
> > data removable. However this wouldn’t fix the case where the view cell has 
> > a timestamp from the future - although that’s not a case that C* can fix 
> > today either.
> 
> I don’t think the first or second option is ideal. We should treat the base 
> table as the source of truth. Modifying it—or forcing an update on it, even 
> if it’s just a timestamp change—is not a good approach and won’t solve all 
> problems.
> 
> > the idea to use anti-compaction makes a lot more sense now (in principle - 
> > I don’t think it’s workable in practice)
> 
> I have one question regarding anti-compaction. Is the main concern that 
> processing too much data during anti-compaction could cause issues for the 
> cluster? 
> 
> > I guess you could add some sort of assassin cell that is meant to remove a 
> > cell with a specific timestamp and value, but is otherwise invisible. 
> 
> The idea of the assassination cell is interesting. To prevent data from being 
> incorrectly removed during the repair process, we need to ensure a quorum of 
> nodes is available and agrees on the same value before repairing a 
> materialized view (MV) row or cell. However, this could be very expensive, as 
> it requires coordination to repair even a single row.
> 
> I think there are a few key differences between MV repair and normal 
> anti-entropy repair:
> 
>  1. For normal anti-entropy repair, there is no single source of truth.
> All replicas act as sources of truth, and eventual consistency is achieved by 
> merging data across them. Even if some data is lost or bugs result in future 
> timestamps, the replicas will eventually converge on the same (possibly 
> incorrect) value, and that value becomes the accepted truth. In contrast, MV 
> repair relies on the base table as the source of truth and modifies the MV if 
> inconsistencies are detected. This means that in some cases, simply merging 
> data won’t resolve the issue, since Cassandra resolves conflicts using 
> timestamps and there’s no guarantee the base table will always "win" unless 
> we change the MV merging logic—which I’ve been trying to avoid.
> 
>  2. I see MV repair as more of an on-demand operation, whereas normal 
> anti-entropy repair needs to run regularly. This means we shouldn’t treat MV 
> repair the same way as existing repairs. When an operator initiates MV 
> repair, they need to ensure that sufficient resources are available to 
> support it.
> 
> 
> On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston  wrote:
>> __
>> Got it, thanks for clearing that up. I’d seen the strict liveness code 
>> around but didn’t realize it was MV related and hadn’t dug into what it did 
>> or how it worked. 
>> 
>> I think you’re right about the row liveness update working for extra data 
>> with timestamps lower than the most recent base table update.
>> 
>> I see what you mean about the timestamp from the future case. I thought of 3 
>> options:
>> 
>> First - we interpret view data with higher timestamps than the base table as 
>> data that’s missing from the base and replicate it into the base table. The 
>> timestamp of the missing data may be below the paxos timestamp low bound so 
>> we’d have to adjust the paxos coordination logic to allow that in t

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-22 Thread Runtian Liu
Sorry, Blake—I was traveling last week and couldn’t reply to your email
sooner.

> First - we interpret view data with higher timestamps than the base table
as data that’s missing from the base and replicate it into the base table.
The timestamp of the missing data may be below the paxos timestamp low
bound so we’d have to adjust the paxos coordination logic to allow that in
this case. Depending on how the view got this way it may also tear writes
to the base table, breaking the write atomicity promise.

As discussed earlier, we want this MV repair mechanism to handle all edge
cases. However, it would be difficult to design it in a way that detects
the root cause of each mismatch and repairs it accordingly. Additionally,
as you mentioned, this approach could introduce other issues, such as
violating the write atomicity guarantee.

> Second - If this happens it means that we’ve either lost base table data
or paxos metadata. If that happened, we could force a base table update
that rewrites the current base state with new timestamps making the extra
view data removable. However this wouldn’t fix the case where the view cell
has a timestamp from the future - although that’s not a case that C* can
fix today either.

I don’t think the first or second option is ideal. We should treat the base
table as the source of truth. Modifying it—or forcing an update on it, even
if it’s just a timestamp change—is not a good approach and won’t solve all
problems.

> the idea to use anti-compaction makes a lot more sense now (in principle
- I don’t think it’s workable in practice)

I have one question regarding anti-compaction. Is the main concern that
processing too much data during anti-compaction could cause issues for the
cluster?

> I guess you could add some sort of assassin cell that is meant to remove
a cell with a specific timestamp and value, but is otherwise invisible.

The idea of the assassination cell is interesting. To prevent data from
being incorrectly removed during the repair process, we need to ensure a
quorum of nodes is available and agrees on the same value before repairing
a materialized view (MV) row or cell. However, this could be very
expensive, as it requires coordination to repair even a single row.

I think there are a few key differences between MV repair and normal
anti-entropy repair:

   1.

   For normal anti-entropy repair, there is no single source of truth.
   All replicas act as sources of truth, and eventual consistency is
   achieved by merging data across them. Even if some data is lost or bugs
   result in future timestamps, the replicas will eventually converge on the
   same (possibly incorrect) value, and that value becomes the accepted truth.
   In contrast, MV repair relies on the base table as the source of truth and
   modifies the MV if inconsistencies are detected. This means that in some
   cases, simply merging data won’t resolve the issue, since Cassandra
   resolves conflicts using timestamps and there’s no guarantee the base table
   will always "win" unless we change the MV merging logic—which I’ve been
   trying to avoid.

   2.

   I see MV repair as more of an on-demand operation, whereas normal
   anti-entropy repair needs to run regularly. This means we shouldn’t treat
   MV repair the same way as existing repairs. When an operator initiates MV
   repair, they need to ensure that sufficient resources are available to
   support it.


On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston 
wrote:

> Got it, thanks for clearing that up. I’d seen the strict liveness code
> around but didn’t realize it was MV related and hadn’t dug into what it did
> or how it worked.
>
> I think you’re right about the row liveness update working for extra data
> with timestamps lower than the most recent base table update.
>
> I see what you mean about the timestamp from the future case. I thought of
> 3 options:
>
> First - we interpret view data with higher timestamps than the base table
> as data that’s missing from the base and replicate it into the base table.
> The timestamp of the missing data may be below the paxos timestamp low
> bound so we’d have to adjust the paxos coordination logic to allow that in
> this case. Depending on how the view got this way it may also tear writes
> to the base table, breaking the write atomicity promise.
>
> Second - If this happens it means that we’ve either lost base table data
> or paxos metadata. If that happened, we could force a base table update
> that rewrites the current base state with new timestamps making the extra
> view data removable. However this wouldn’t fix the case where the view cell
> has a timestamp from the future - although that’s not a case that C* can
> fix today either.
>
> Third - we add a new tombstone type or some mechanism to delete specific
> cells that doesn’t preclude correct writes with lower timestamps from being
> visible. I’m not sure how this would work, and the idea to use
> anti-compaction makes a lot more sen

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-12 Thread Blake Eggleston
Got it, thanks for clearing that up. I’d seen the strict liveness code around 
but didn’t realize it was MV related and hadn’t dug into what it did or how it 
worked. 

I think you’re right about the row liveness update working for extra data with 
timestamps lower than the most recent base table update.

I see what you mean about the timestamp from the future case. I thought of 3 
options:

First - we interpret view data with higher timestamps than the base table as 
data that’s missing from the base and replicate it into the base table. The 
timestamp of the missing data may be below the paxos timestamp low bound so 
we’d have to adjust the paxos coordination logic to allow that in this case. 
Depending on how the view got this way it may also tear writes to the base 
table, breaking the write atomicity promise.

Second - If this happens it means that we’ve either lost base table data or 
paxos metadata. If that happened, we could force a base table update that 
rewrites the current base state with new timestamps making the extra view data 
removable. However this wouldn’t fix the case where the view cell has a 
timestamp from the future - although that’s not a case that C* can fix today 
either.

Third - we add a new tombstone type or some mechanism to delete specific cells 
that doesn’t preclude correct writes with lower timestamps from being visible. 
I’m not sure how this would work, and the idea to use anti-compaction makes a 
lot more sense now (in principle - I don’t think it’s workable in practice). I 
guess you could add some sort of assassin cell that is meant to remove a cell 
with a specific timestamp and value, but is otherwise invisible. This seems 
dangerous though, since it’s likely there’s a replication problem that may 
resolve itself and the repair process would actually be removing data that the 
user intended to write.

Paulo - I don’t think storage changes are off the table, but they do expand the 
scope and risk of the proposal, so we should be careful.

On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote:
>  > I’m not sure if this is the only edge case—there may be other issues as 
> well. I’m also unsure whether we should redesign the tombstone handling for 
> MVs, since that would involve changes to the storage engine. To minimize 
> impact there, the original proposal was to rebuild the affected ranges using 
> anti-compaction, just to be safe.
> 
> I haven't been following the discussion but I think one of the issues with 
> the materialized view "strict liveness" fix[1] is that we avoided making 
> invasive changes to the storage engine at the time, but this was considered 
> by Zhao on [1]. I think we shouldn't be trying to avoid updates to the 
> storage format as part of the MV implementation, if this is what it takes to 
> make MVs V2 reliable.
> 
> [1] - 
> https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603
> 
> On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu  wrote:
>> The current design leverages strict liveness to shadow the old view row. 
>> When the view-indexed value changes from 'a' to 'b', no tombstone is 
>> written; instead, the old row is marked as expired by updating its liveness 
>> info with the timestamp of the change. If the column is later set back to 
>> 'a', the view row is re-inserted with a new, non-expired liveness info 
>> reflecting the latest timestamp.
>> 
>> To delete an extra row in the materialized view (MV), we can likely use the 
>> same approach—marking it as shadowed by updating the liveness info. However, 
>> in the case of inconsistencies where a column in the MV has a higher 
>> timestamp than the corresponding column in the base table, this row-level 
>> liveness mechanism is insufficient.
>> 
>> Even for the case where we delete the row by marking its liveness info as 
>> expired during repair, there are concerns. Since this introduces a data 
>> mutation as part of the repair process, it’s unclear whether there could be 
>> edge cases we’re missing. This approach may risk unexpected side effects if 
>> the repair logic is not carefully aligned with write path semantics.
>> 
>> 
>> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston  wrote:
>>> __
>>> That’s a good point, although as described I don’t think that could ever 
>>> work properly, even in normal operation. Either we’re misunderstanding 
>>> something, or this is a flaw in the current MV design.
>>> 
>>> Assuming changing the view indexed column results in a tombstone being 
>>> applied to the view row for the previous value, if we wrote the other base 
>>> columns (the non view indexed ones) to the view with the same timestamps 
>>> they have on the base, then changing the view indexed value from ‘a’ to 
>>> ‘b’, then back to ‘a’ would always cause this problem. I think you’d need 
>>> to always update the column timestamps on the view to be >= the view column 
>>> 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-11 Thread Paulo Motta
 > I’m not sure if this is the only edge case—there may be other issues as
well. I’m also unsure whether we should redesign the tombstone handling for
MVs, since that would involve changes to the storage engine. To minimize
impact there, the original proposal was to rebuild the affected ranges
using anti-compaction, just to be safe.

I haven't been following the discussion but I think one of the issues with
the materialized view "strict liveness" fix[1] is that we avoided making
invasive changes to the storage engine at the time, but this was considered
by Zhao on [1]. I think we shouldn't be trying to avoid updates to the
storage format as part of the MV implementation, if this is what it takes
to make MVs V2 reliable.

[1] -
https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603

On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu  wrote:

> The current design leverages strict liveness to shadow the old view row.
> When the view-indexed value changes from 'a' to 'b', no tombstone is
> written; instead, the old row is marked as expired by updating its liveness
> info with the timestamp of the change. If the column is later set back to
> 'a', the view row is re-inserted with a new, non-expired liveness info
> reflecting the latest timestamp.
>
> To delete an extra row in the materialized view (MV), we can likely use
> the same approach—marking it as shadowed by updating the liveness info.
> However, in the case of inconsistencies where a column in the MV has a
> higher timestamp than the corresponding column in the base table, this
> row-level liveness mechanism is insufficient.
>
> Even for the case where we delete the row by marking its liveness info as
> expired during repair, there are concerns. Since this introduces a data
> mutation as part of the repair process, it’s unclear whether there could be
> edge cases we’re missing. This approach may risk unexpected side effects if
> the repair logic is not carefully aligned with write path semantics.
>
>
> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston 
> wrote:
>
>> That’s a good point, although as described I don’t think that could ever
>> work properly, even in normal operation. Either we’re misunderstanding
>> something, or this is a flaw in the current MV design.
>>
>> Assuming changing the view indexed column results in a tombstone being
>> applied to the view row for the previous value, if we wrote the other base
>> columns (the non view indexed ones) to the view with the same timestamps
>> they have on the base, then changing the view indexed value from ‘a’ to
>> ‘b’, then back to ‘a’ would always cause this problem. I think you’d need
>> to always update the column timestamps on the view to be >= the view column
>> timestamp on the base
>>
>> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
>>
>> > In the case of a missed update, we'll have a new value and we can send
>> a tombstone to the view with the timestamp of the most recent update.
>>
>> > then something has gone wrong and we should issue a tombstone using the
>> paxos repair timestamp as the tombstone timestamp.
>>
>> The current MV implementation uses “strict liveness” to determine whether
>> a row is live. I believe that using regular tombstones during repair could
>> cause problems. For example, consider a base table with schema (pk, ck, v1,
>> v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some
>> reason, we detect an extra row in the MV and delete it using a tombstone
>> with the latest update timestamp, we may run into issues. Suppose we later
>> update the base table’s v1 field to match the MV row we previously deleted,
>> and the v2 value now has an older timestamp. In that case, the previously
>> issued tombstone could still shadow the v2 column, which is unintended.
>> That is why I was asking if we are going to introduce a new kind of
>> tombstones. I’m not sure if this is the only edge case—there may be
>> other issues as well. I’m also unsure whether we should redesign the
>> tombstone handling for MVs, since that would involve changes to the storage
>> engine. To minimize impact there, the original proposal was to rebuild the
>> affected ranges using anti-compaction, just to be safe.
>>
>> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston 
>> wrote:
>>
>>
>>  Extra row in MV (assuming the tombstone is gone in the base table) — how
>> should we fix this?
>>
>>
>>
>> This would mean that the base table had either updated or deleted a row
>> and the view didn't receive the corresponding delete.
>>
>> In the case of a missed update, we'll have a new value and we can send a
>> tombstone to the view with the timestamp of the most recent update. Since
>> timestamps issued by paxos and accord writes are always increasing
>> monotonically and don't have collisions, this is safe.
>>
>> In the case of a row deletion, we'd also want to send a tombstone with
>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-11 Thread Runtian Liu
The current design leverages strict liveness to shadow the old view row.
When the view-indexed value changes from 'a' to 'b', no tombstone is
written; instead, the old row is marked as expired by updating its liveness
info with the timestamp of the change. If the column is later set back to
'a', the view row is re-inserted with a new, non-expired liveness info
reflecting the latest timestamp.

To delete an extra row in the materialized view (MV), we can likely use the
same approach—marking it as shadowed by updating the liveness info.
However, in the case of inconsistencies where a column in the MV has a
higher timestamp than the corresponding column in the base table, this
row-level liveness mechanism is insufficient.

Even for the case where we delete the row by marking its liveness info as
expired during repair, there are concerns. Since this introduces a data
mutation as part of the repair process, it’s unclear whether there could be
edge cases we’re missing. This approach may risk unexpected side effects if
the repair logic is not carefully aligned with write path semantics.


On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston 
wrote:

> That’s a good point, although as described I don’t think that could ever
> work properly, even in normal operation. Either we’re misunderstanding
> something, or this is a flaw in the current MV design.
>
> Assuming changing the view indexed column results in a tombstone being
> applied to the view row for the previous value, if we wrote the other base
> columns (the non view indexed ones) to the view with the same timestamps
> they have on the base, then changing the view indexed value from ‘a’ to
> ‘b’, then back to ‘a’ would always cause this problem. I think you’d need
> to always update the column timestamps on the view to be >= the view column
> timestamp on the base
>
> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
>
> > In the case of a missed update, we'll have a new value and we can send
> a tombstone to the view with the timestamp of the most recent update.
>
> > then something has gone wrong and we should issue a tombstone using the
> paxos repair timestamp as the tombstone timestamp.
>
> The current MV implementation uses “strict liveness” to determine whether
> a row is live. I believe that using regular tombstones during repair could
> cause problems. For example, consider a base table with schema (pk, ck, v1,
> v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some
> reason, we detect an extra row in the MV and delete it using a tombstone
> with the latest update timestamp, we may run into issues. Suppose we later
> update the base table’s v1 field to match the MV row we previously deleted,
> and the v2 value now has an older timestamp. In that case, the previously
> issued tombstone could still shadow the v2 column, which is unintended.
> That is why I was asking if we are going to introduce a new kind of
> tombstones. I’m not sure if this is the only edge case—there may be other
> issues as well. I’m also unsure whether we should redesign the tombstone
> handling for MVs, since that would involve changes to the storage engine.
> To minimize impact there, the original proposal was to rebuild the affected
> ranges using anti-compaction, just to be safe.
>
> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston 
> wrote:
>
>
>  Extra row in MV (assuming the tombstone is gone in the base table) — how
> should we fix this?
>
>
>
> This would mean that the base table had either updated or deleted a row
> and the view didn't receive the corresponding delete.
>
> In the case of a missed update, we'll have a new value and we can send a
> tombstone to the view with the timestamp of the most recent update. Since
> timestamps issued by paxos and accord writes are always increasing
> monotonically and don't have collisions, this is safe.
>
> In the case of a row deletion, we'd also want to send a tombstone with the
> same timestamp, however since tombstones can be purged, we may not have
> that information and would have to treat it like the view has a higher
> timestamp than the base table.
>
> Inconsistency (timestamps don’t match) — it’s easy to fix when the base
> table has higher timestamps, but how do we resolve it when the MV columns
> have higher timestamps?
>
>
> There are 2 ways this could happen. First is that a write failed and paxos
> repair hasn't completed it, which is expected, and the second is a
> replication bug or base table data loss. You'd need to compare the view
> timestamp to the paxos repair history to tell which it is. If the view
> timestamp is higher than the most recent paxos repair timestamp for the
> key, then it may just be a failed write and we should do nothing. If the
> view timestamp is less than the most recent paxos repair timestamp for that
> key and higher than the base timestamp, then something has gone wrong and
> we should issue a tombstone using the paxos repair timestamp as the
> tombstone timestamp. 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-11 Thread Blake Eggleston
That’s a good point, although as described I don’t think that could ever work 
properly, even in normal operation. Either we’re misunderstanding something, or 
this is a flaw in the current MV design. 

Assuming changing the view indexed column results in a tombstone being applied 
to the view row for the previous value, if we wrote the other base columns (the 
non view indexed ones) to the view with the same timestamps they have on the 
base, then changing the view indexed value from ‘a’ to ‘b’, then back to ‘a’ 
would always cause this problem. I think you’d need to always update the column 
timestamps on the view to be >= the view column timestamp on the base

On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
> > In the case of a missed update, we'll have a new value and we can send a 
> > tombstone to the view with the timestamp of the most recent update.
> 
> > then something has gone wrong and we should issue a tombstone using the 
> > paxos repair timestamp as the tombstone timestamp.
> 
> The current MV implementation uses “strict liveness” to determine whether a 
> row is live. I believe that using regular tombstones during repair could 
> cause problems. For example, consider a base table with schema (pk, ck, v1, 
> v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some 
> reason, we detect an extra row in the MV and delete it using a tombstone with 
> the latest update timestamp, we may run into issues. Suppose we later update 
> the base table’s v1 field to match the MV row we previously deleted, and the 
> v2 value now has an older timestamp. In that case, the previously issued 
> tombstone could still shadow the v2 column, which is unintended. That is why 
> I was asking if we are going to introduce a new kind of tombstones. I’m not 
> sure if this is the only edge case—there may be other issues as well. I’m 
> also unsure whether we should redesign the tombstone handling for MVs, since 
> that would involve changes to the storage engine. To minimize impact there, 
> the original proposal was to rebuild the affected ranges using 
> anti-compaction, just to be safe.
> 
> 
> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston  wrote:
>> __
>>>  Extra row in MV (assuming the tombstone is gone in the base table) — how 
>>> should we fix this?
>>> 
>>> 
>>> 
>> 
>> This would mean that the base table had either updated or deleted a row and 
>> the view didn't receive the corresponding delete.
>> 
>> In the case of a missed update, we'll have a new value and we can send a 
>> tombstone to the view with the timestamp of the most recent update. Since 
>> timestamps issued by paxos and accord writes are always increasing 
>> monotonically and don't have collisions, this is safe.
>> 
>> In the case of a row deletion, we'd also want to send a tombstone with the 
>> same timestamp, however since tombstones can be purged, we may not have that 
>> information and would have to treat it like the view has a higher timestamp 
>> than the base table.
>> 
>>> Inconsistency (timestamps don’t match) — it’s easy to fix when the base 
>>> table has higher timestamps, but how do we resolve it when the MV columns 
>>> have higher timestamps?
>>> 
>> 
>> There are 2 ways this could happen. First is that a write failed and paxos 
>> repair hasn't completed it, which is expected, and the second is a 
>> replication bug or base table data loss. You'd need to compare the view 
>> timestamp to the paxos repair history to tell which it is. If the view 
>> timestamp is higher than the most recent paxos repair timestamp for the key, 
>> then it may just be a failed write and we should do nothing. If the view 
>> timestamp is less than the most recent paxos repair timestamp for that key 
>> and higher than the base timestamp, then something has gone wrong and we 
>> should issue a tombstone using the paxos repair timestamp as the tombstone 
>> timestamp. This is safe to do because the paxos repair timestamps act as a 
>> low bound for ballots paxos will process, so it wouldn't be possible for a 
>> legitimate write to be shadowed by this tombstone.
>> 
>>> Do we need to introduce a new kind of tombstone to shadow the rows in the 
>>> MV for cases 2 and 3? If yes, how will this tombstone work? If no, how 
>>> should we fix the MV data?
>>> 
>> 
>> No, a normal tombstone would work.
>> 
>> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
>>> Okay, let’s put the efficiency discussion on hold for now. I want to make 
>>> sure the actual repair process after detecting inconsistencies will work 
>>> with the index-based solution.
>>> 
>>> When a mismatch is detected, the MV replica will need to stream its index 
>>> file to the base table replica. The base table will then perform a 
>>> comparison between the two files.
>>> 
>>> There are three cases we need to handle:
>>> 
>>>  1. Missing row in MV — this is straightforward; we can propagate the data 
>>> to the MV.
>>> 
>>>  2. Extra row in MV (assuming the t

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-11 Thread Runtian Liu
> In the case of a missed update, we'll have a new value and we can send a
tombstone to the view with the timestamp of the most recent update.

> then something has gone wrong and we should issue a tombstone using the
paxos repair timestamp as the tombstone timestamp.

The current MV implementation uses “strict liveness” to determine whether a
row is live. I believe that using regular tombstones during repair could
cause problems. For example, consider a base table with schema (pk, ck, v1,
v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some
reason, we detect an extra row in the MV and delete it using a tombstone
with the latest update timestamp, we may run into issues. Suppose we later
update the base table’s v1 field to match the MV row we previously deleted,
and the v2 value now has an older timestamp. In that case, the previously
issued tombstone could still shadow the v2 column, which is unintended.
That is why I was asking if we are going to introduce a new kind of
tombstones. I’m not sure if this is the only edge case—there may be other
issues as well. I’m also unsure whether we should redesign the tombstone
handling for MVs, since that would involve changes to the storage engine.
To minimize impact there, the original proposal was to rebuild the affected
ranges using anti-compaction, just to be safe.

On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston 
wrote:

>  Extra row in MV (assuming the tombstone is gone in the base table) — how
> should we fix this?
>
>
>
> This would mean that the base table had either updated or deleted a row
> and the view didn't receive the corresponding delete.
>
> In the case of a missed update, we'll have a new value and we can send a
> tombstone to the view with the timestamp of the most recent update. Since
> timestamps issued by paxos and accord writes are always increasing
> monotonically and don't have collisions, this is safe.
>
> In the case of a row deletion, we'd also want to send a tombstone with the
> same timestamp, however since tombstones can be purged, we may not have
> that information and would have to treat it like the view has a higher
> timestamp than the base table.
>
> Inconsistency (timestamps don’t match) — it’s easy to fix when the base
> table has higher timestamps, but how do we resolve it when the MV columns
> have higher timestamps?
>
>
> There are 2 ways this could happen. First is that a write failed and paxos
> repair hasn't completed it, which is expected, and the second is a
> replication bug or base table data loss. You'd need to compare the view
> timestamp to the paxos repair history to tell which it is. If the view
> timestamp is higher than the most recent paxos repair timestamp for the
> key, then it may just be a failed write and we should do nothing. If the
> view timestamp is less than the most recent paxos repair timestamp for that
> key and higher than the base timestamp, then something has gone wrong and
> we should issue a tombstone using the paxos repair timestamp as the
> tombstone timestamp. This is safe to do because the paxos repair timestamps
> act as a low bound for ballots paxos will process, so it wouldn't be
> possible for a legitimate write to be shadowed by this tombstone.
>
> Do we need to introduce a new kind of tombstone to shadow the rows in the
> MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
> should we fix the MV data?
>
>
> No, a normal tombstone would work.
>
> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
>
> Okay, let’s put the efficiency discussion on hold for now. I want to make
> sure the actual repair process after detecting inconsistencies will work
> with the index-based solution.
>
> When a mismatch is detected, the MV replica will need to stream its index
> file to the base table replica. The base table will then perform a
> comparison between the two files.
>
> There are three cases we need to handle:
>
>1.
>
>Missing row in MV — this is straightforward; we can propagate the data
>to the MV.
>2.
>
>Extra row in MV (assuming the tombstone is gone in the base table) —
>how should we fix this?
>3.
>
>Inconsistency (timestamps don’t match) — it’s easy to fix when the
>base table has higher timestamps, but how do we resolve it when the MV
>columns have higher timestamps?
>
> Do we need to introduce a new kind of tombstone to shadow the rows in the
> MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
> should we fix the MV data?
>
> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston 
> wrote:
>
>
> > hopefully we can come up with a solution that everyone agrees on.
>
> I’m sure we can, I think we’ve been making good progress
>
> > My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically.
>
> So the additional overhead of maintaining a storage attached index on the
> client write path is pretty minimal - it’s basica

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-10 Thread Blake Eggleston
>  Extra row in MV (assuming the tombstone is gone in the base table) — how 
> should we fix this?
> 
> 
> 

This would mean that the base table had either updated or deleted a row and the 
view didn't receive the corresponding delete. 

In the case of a missed update, we'll have a new value and we can send a 
tombstone to the view with the timestamp of the most recent update. Since 
timestamps issued by paxos and accord writes are always increasing 
monotonically and don't have collisions, this is safe. 

In the case of a row deletion, we'd also want to send a tombstone with the same 
timestamp, however since tombstones can be purged, we may not have that 
information and would have to treat it like the view has a higher timestamp 
than the base table.

> Inconsistency (timestamps don’t match) — it’s easy to fix when the base table 
> has higher timestamps, but how do we resolve it when the MV columns have 
> higher timestamps?
> 

There are 2 ways this could happen. First is that a write failed and paxos 
repair hasn't completed it, which is expected, and the second is a replication 
bug or base table data loss. You'd need to compare the view timestamp to the 
paxos repair history to tell which it is. If the view timestamp is higher than 
the most recent paxos repair timestamp for the key, then it may just be a 
failed write and we should do nothing. If the view timestamp is less than the 
most recent paxos repair timestamp for that key and higher than the base 
timestamp, then something has gone wrong and we should issue a tombstone using 
the paxos repair timestamp as the tombstone timestamp. This is safe to do 
because the paxos repair timestamps act as a low bound for ballots paxos will 
process, so it wouldn't be possible for a legitimate write to be shadowed by 
this tombstone.

> Do we need to introduce a new kind of tombstone to shadow the rows in the MV 
> for cases 2 and 3? If yes, how will this tombstone work? If no, how should we 
> fix the MV data?
> 

No, a normal tombstone would work.

On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
> Okay, let’s put the efficiency discussion on hold for now. I want to make 
> sure the actual repair process after detecting inconsistencies will work with 
> the index-based solution.
> 
> When a mismatch is detected, the MV replica will need to stream its index 
> file to the base table replica. The base table will then perform a comparison 
> between the two files.
> 
> There are three cases we need to handle:
> 
>  1. Missing row in MV — this is straightforward; we can propagate the data to 
> the MV.
> 
>  2. Extra row in MV (assuming the tombstone is gone in the base table) — how 
> should we fix this?
> 
>  3. Inconsistency (timestamps don’t match) — it’s easy to fix when the base 
> table has higher timestamps, but how do we resolve it when the MV columns 
> have higher timestamps?
> 
> Do we need to introduce a new kind of tombstone to shadow the rows in the MV 
> for cases 2 and 3? If yes, how will this tombstone work? If no, how should we 
> fix the MV data?
> 
> 
> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston  wrote:
>> __
>> > hopefully we can come up with a solution that everyone agrees on.
>> 
>> I’m sure we can, I think we’ve been making good progress
>> 
>> > My main concern with the index-based solution is the overhead it adds to 
>> > the hot path, as well as having to build indexes periodically.
>> 
>> So the additional overhead of maintaining a storage attached index on the 
>> client write path is pretty minimal - it’s basically adding data to an in 
>> memory trie. It’s a little extra work and memory usage, but there isn’t any 
>> extra io or other blocking associated with it. I’d expect the latency impact 
>> to be negligible.
>> 
>> > As mentioned earlier, this MV repair should be an infrequent operation
>> 
>> I don’t this that’s a safe assumption. There are a lot of situations outside 
>> of data loss bugs where repair would need to be run. 
>> 
>> These use cases could probably be handled by repairing the view with other 
>> view replicas:
>> 
>> Scrubbing corrupt sstables
>> Node replacement via backup
>> 
>> These use cases would need an actual MV repair to check consistency with the 
>> base table:
>> 
>> Restoring a cluster from a backup
>> Imported sstables via nodetool import
>> Data loss from operator error
>> Proactive consistency checks - ie preview repairs
>> 
>> Even if it is an infrequent operation, when operators need it, it needs to 
>> be available and reliable.
>> 
>> It’s a fact that there are clusters where non-incremental repairs are run on 
>> a cadence of a week or more to manage the overhead of validation 
>> compactions. Assuming the cluster doesn’t have any additional headroom, that 
>> would mean that any one of the above events could cause views to remain out 
>> of sync for up to a week while the full set of merkle trees is being built.
>> 
>> This delay eliminates a lot of the value of repair as a

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-10 Thread Runtian Liu
Okay, let’s put the efficiency discussion on hold for now. I want to make
sure the actual repair process after detecting inconsistencies will work
with the index-based solution.

When a mismatch is detected, the MV replica will need to stream its index
file to the base table replica. The base table will then perform a
comparison between the two files.

There are three cases we need to handle:

   1.

   Missing row in MV — this is straightforward; we can propagate the data
   to the MV.

   2.

   Extra row in MV (assuming the tombstone is gone in the base table) — how
   should we fix this?

   3.

   Inconsistency (timestamps don’t match) — it’s easy to fix when the base
   table has higher timestamps, but how do we resolve it when the MV columns
   have higher timestamps?

Do we need to introduce a new kind of tombstone to shadow the rows in the
MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
should we fix the MV data?

On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston 
wrote:

> > hopefully we can come up with a solution that everyone agrees on.
>
> I’m sure we can, I think we’ve been making good progress
>
> > My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically.
>
> So the additional overhead of maintaining a storage attached index on the
> client write path is pretty minimal - it’s basically adding data to an in
> memory trie. It’s a little extra work and memory usage, but there isn’t any
> extra io or other blocking associated with it. I’d expect the latency
> impact to be negligible.
>
> > As mentioned earlier, this MV repair should be an infrequent operation
>
> I don’t this that’s a safe assumption. There are a lot of situations
> outside of data loss bugs where repair would need to be run.
>
> These use cases could probably be handled by repairing the view with other
> view replicas:
>
> Scrubbing corrupt sstables
> Node replacement via backup
>
> These use cases would need an actual MV repair to check consistency with
> the base table:
>
> Restoring a cluster from a backup
> Imported sstables via nodetool import
> Data loss from operator error
> Proactive consistency checks - ie preview repairs
>
> Even if it is an infrequent operation, when operators need it, it needs to
> be available and reliable.
>
> It’s a fact that there are clusters where non-incremental repairs are run
> on a cadence of a week or more to manage the overhead of validation
> compactions. Assuming the cluster doesn’t have any additional headroom,
> that would mean that any one of the above events could cause views to
> remain out of sync for up to a week while the full set of merkle trees is
> being built.
>
> This delay eliminates a lot of the value of repair as a risk mitigation
> tool. If I had to make a recommendation where a bad call could cost me my
> job, the prospect of a 7 day delay on repair would mean a strong no.
>
> Some users also run preview repair continuously to detect data consistency
> errors, so at least a subset of users will probably be running MV repairs
> continuously - at least in preview mode.
>
> That’s why I say that the replication path should be designed to never
> need repair, and MV repair should be designed to be prepared for the worst.
>
> > I’m wondering if it’s possible to enable or disable index building
> dynamically so that we don’t always incur the cost for something that’s
> rarely needed.
>
> I think this would be a really reasonable compromise as long as the
> default is on. That way it’s as safe as possible by default, but users who
> don’t care or have a separate system for repairing MVs can opt out.
>
> > I’m not sure what you mean by “data problems” here.
>
> I mean out of sync views - either due to bugs, operator error, corruption,
> etc
>
> > Also, this does scale with cluster size—I’ve compared it to full repair,
> and this MV repair should behave similarly. That means as long as full
> repair works, this repair should work as well.
>
> You could build the merkle trees at about the same cost as a full repair,
> but the actual data repair path is completely different for MV, and that’s
> the part that doesn’t scale well. As you know, with normal repair, we just
> stream data for ranges detected as out of sync. For Mvs, since the data
> isn’t in base partition order, the view data for an out of sync view range
> needs to be read out and streamed to every base replica that it’s detected
> a mismatch against. So in the example I gave with the 300 node cluster,
> you’re looking at reading and transmitting the same partition at least 100
> times in the best case, and the cost of this keeps going up as the cluster
> increases in size. That's the part that doesn't scale well.
>
> This is also one the benefits of the index design. Since it stores data in
> segments that roughly correspond to points on the grid, you’re not
> rereading the same data over and over. A repair for a g

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-08 Thread Blake Eggleston
> hopefully we can come up with a solution that everyone agrees on.

I’m sure we can, I think we’ve been making good progress

> My main concern with the index-based solution is the overhead it adds to the 
> hot path, as well as having to build indexes periodically.

So the additional overhead of maintaining a storage attached index on the 
client write path is pretty minimal - it’s basically adding data to an in 
memory trie. It’s a little extra work and memory usage, but there isn’t any 
extra io or other blocking associated with it. I’d expect the latency impact to 
be negligible.

> As mentioned earlier, this MV repair should be an infrequent operation

I don’t this that’s a safe assumption. There are a lot of situations outside of 
data loss bugs where repair would need to be run. 

These use cases could probably be handled by repairing the view with other view 
replicas:

Scrubbing corrupt sstables
Node replacement via backup

These use cases would need an actual MV repair to check consistency with the 
base table:

Restoring a cluster from a backup
Imported sstables via nodetool import
Data loss from operator error
Proactive consistency checks - ie preview repairs

Even if it is an infrequent operation, when operators need it, it needs to be 
available and reliable.

It’s a fact that there are clusters where non-incremental repairs are run on a 
cadence of a week or more to manage the overhead of validation compactions. 
Assuming the cluster doesn’t have any additional headroom, that would mean that 
any one of the above events could cause views to remain out of sync for up to a 
week while the full set of merkle trees is being built.

This delay eliminates a lot of the value of repair as a risk mitigation tool. 
If I had to make a recommendation where a bad call could cost me my job, the 
prospect of a 7 day delay on repair would mean a strong no.

Some users also run preview repair continuously to detect data consistency 
errors, so at least a subset of users will probably be running MV repairs 
continuously - at least in preview mode.

That’s why I say that the replication path should be designed to never need 
repair, and MV repair should be designed to be prepared for the worst.

> I’m wondering if it’s possible to enable or disable index building 
> dynamically so that we don’t always incur the cost for something that’s 
> rarely needed.

I think this would be a really reasonable compromise as long as the default is 
on. That way it’s as safe as possible by default, but users who don’t care or 
have a separate system for repairing MVs can opt out.

> I’m not sure what you mean by “data problems” here.

I mean out of sync views - either due to bugs, operator error, corruption, etc

> Also, this does scale with cluster size—I’ve compared it to full repair, and 
> this MV repair should behave similarly. That means as long as full repair 
> works, this repair should work as well.

You could build the merkle trees at about the same cost as a full repair, but 
the actual data repair path is completely different for MV, and that’s the part 
that doesn’t scale well. As you know, with normal repair, we just stream data 
for ranges detected as out of sync. For Mvs, since the data isn’t in base 
partition order, the view data for an out of sync view range needs to be read 
out and streamed to every base replica that it’s detected a mismatch against. 
So in the example I gave with the 300 node cluster, you’re looking at reading 
and transmitting the same partition at least 100 times in the best case, and 
the cost of this keeps going up as the cluster increases in size. That's the 
part that doesn't scale well. 

This is also one the benefits of the index design. Since it stores data in 
segments that roughly correspond to points on the grid, you’re not rereading 
the same data over and over. A repair for a given grid point only reads an 
amount of data proportional to the data in common for the base/view grid point, 
and it’s stored in a small enough granularity that the base can calculate what 
data needs to be sent to the view without having to read the entire view 
partition.

On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote:
> Thanks, Blake. I’m open to iterating on the design, and hopefully we can come 
> up with a solution that everyone agrees on.
> 
> My main concern with the index-based solution is the overhead it adds to the 
> hot path, as well as having to build indexes periodically. As mentioned 
> earlier, this MV repair should be an infrequent operation, but the 
> index-based approach shifts some of the work to the hot path in order to 
> allow repairs that touch only a few nodes.
> 
> I’m wondering if it’s possible to enable or disable index building 
> dynamically so that we don’t always incur the cost for something that’s 
> rarely needed.
> 
> > it degrades operators ability to react to data problems by imposing a 
> > significant upfront processing burden on repair, and that it doe

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-07 Thread Runtian Liu
Thanks, Blake. I’m open to iterating on the design, and hopefully we can
come up with a solution that everyone agrees on.

My main concern with the index-based solution is the overhead it adds to
the hot path, as well as having to build indexes periodically. As mentioned
earlier, this MV repair should be an infrequent operation, but the
index-based approach shifts some of the work to the hot path in order to
allow repairs that touch only a few nodes.

I’m wondering if it’s possible to enable or disable index building
dynamically so that we don’t always incur the cost for something that’s
rarely needed.

> it degrades operators ability to react to data problems by imposing a
significant upfront processing burden on repair, and that it doesn’t scale
well with cluster size

I’m not sure what you mean by “data problems” here. Also, this does scale
with cluster size—I’ve compared it to full repair, and this MV repair
should behave similarly. That means as long as full repair works, this
repair should work as well.

For example, regardless of how large the cluster is, you can always enable
Merkle tree building on 10% of the nodes at a time until all the trees are
ready.

I understand that coordinating this type of repair is harder than what we
currently support, but with CEP-37, we should be able to handle this
coordination without adding too much burden on the operator side.


On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston  wrote:

> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work.
>
>
> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a
> competition. We can collaborate and figure out the best approach to the
> problem. I’d like to keep discussing it if you’re open to iterating on the
> design.
>
> I’m not married to our proposal, it’s just the cleanest way we could think
> of to address what Jon and I both see as blockers in the current proposal.
> It’s not set in stone though.
>
> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
>
> Hmm, I am very surprised as I helped write that and I distinctly recall a
> specific goal was avoiding binding vetoes as they're so toxic.
>
> Ok, I guess you can block this work if you like.
>
> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work. That isn't a functioning
> community in my mind, so I'll be noping out for a while, as I don't see
> much value here right now.
>
>
> On 2025/06/06 18:31:08 Blake Eggleston wrote:
> > Hi Benedict, that’s actually not true.
> >
> > Here’s a link to the project governance page: _https://
> cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
> >
> > The CEP section says:
> >
> > “*Once the proposal is finalized and any major committer dissent
> reconciled, call a [VOTE] on the ML to have the proposal adopted. The
> criteria for acceptance is consensus (3 binding +1 votes and no binding
> vetoes). The vote should remain open for 72 hours.*”
> >
> > So they’re definitely vetoable.
> >
> > Also note the part about “*Once the proposal is finalized and any major
> committer dissent reconciled,*” being a prerequisite for moving a CEP to
> [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even
> be appropriate to move to a VOTE until we get to the bottom of this repair
> discussion.
> >
> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > > > but the snapshot repair design is not a viable path forward. It’s
> the first iteration of a repair design. We’ve proposed a second iteration,
> and we’re open to a third iteration.
> > >
> > > I shan't be participating further in discussion, but I want to make a
> point of order. The CEP process has no vetoes, so you are not empowered to
> declare that a design is not viable without the input of the wider
> community.
> > >
> > >
> > > On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > > > You can detect and fix the mismatch in a single round of repair, but
> the amount of work needed to do it is _significantly_ higher with snapshot
> repair. Consider a case where we have a 300 node cluster w/ RF 3, where
> each view partition contains entries mapping to every token range in the
> cluster - so 100 ranges. If we lose a view sstable, it will affect an
> entire row/column of the grid. Repair is going to scan all data in the
> mismatching view token ranges 100 times, and each base range once. So
> you’re looking at 200 range scans.
> > > >
> > > > Now, you may argue that you can merge the duplicate view scans into
> a single scan while you repair all token ranges in parallel. I’m skeptical
> that’s going to be achievable in practice, but even if it is, we’re now
> talking about the view replica hypothetically doing a pairwise repair

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Blake Eggleston
> I don't see any outcome here that is good for the community though. Either 
> Runtian caves and adopts your design that he (and I) consider inferior, or he 
> is prevented from contributing this work.

Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a competition. 
We can collaborate and figure out the best approach to the problem. I’d like to 
keep discussing it if you’re open to iterating on the design.

I’m not married to our proposal, it’s just the cleanest way we could think of 
to address what Jon and I both see as blockers in the current proposal. It’s 
not set in stone though.

On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
> Hmm, I am very surprised as I helped write that and I distinctly recall a 
> specific goal was avoiding binding vetoes as they're so toxic.
> 
> Ok, I guess you can block this work if you like.
> 
> I don't see any outcome here that is good for the community though. Either 
> Runtian caves and adopts your design that he (and I) consider inferior, or he 
> is prevented from contributing this work. That isn't a functioning community 
> in my mind, so I'll be noping out for a while, as I don't see much value here 
> right now.
> 
> 
> On 2025/06/06 18:31:08 Blake Eggleston wrote:
> > Hi Benedict, that’s actually not true. 
> > 
> > Here’s a link to the project governance page: 
> > _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
> > 
> > The CEP section says:
> > 
> > “*Once the proposal is finalized and any major committer dissent 
> > reconciled, call a [VOTE] on the ML to have the proposal adopted. The 
> > criteria for acceptance is consensus (3 binding +1 votes and no binding 
> > vetoes). The vote should remain open for 72 hours.*”
> > 
> > So they’re definitely vetoable. 
> > 
> > Also note the part about “*Once the proposal is finalized and any major 
> > committer dissent reconciled,*” being a prerequisite for moving a CEP to 
> > [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even 
> > be appropriate to move to a VOTE until we get to the bottom of this repair 
> > discussion.
> > 
> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > > > but the snapshot repair design is not a viable path forward. It’s the 
> > > > first iteration of a repair design. We’ve proposed a second iteration, 
> > > > and we’re open to a third iteration.
> > > 
> > > I shan't be participating further in discussion, but I want to make a 
> > > point of order. The CEP process has no vetoes, so you are not empowered 
> > > to declare that a design is not viable without the input of the wider 
> > > community.
> > > 
> > > 
> > > On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > > > You can detect and fix the mismatch in a single round of repair, but 
> > > > the amount of work needed to do it is _significantly_ higher with 
> > > > snapshot repair. Consider a case where we have a 300 node cluster w/ RF 
> > > > 3, where each view partition contains entries mapping to every token 
> > > > range in the cluster - so 100 ranges. If we lose a view sstable, it 
> > > > will affect an entire row/column of the grid. Repair is going to scan 
> > > > all data in the mismatching view token ranges 100 times, and each base 
> > > > range once. So you’re looking at 200 range scans.
> > > > 
> > > > Now, you may argue that you can merge the duplicate view scans into a 
> > > > single scan while you repair all token ranges in parallel. I’m 
> > > > skeptical that’s going to be achievable in practice, but even if it is, 
> > > > we’re now talking about the view replica hypothetically doing a 
> > > > pairwise repair with every other replica in the cluster at the same 
> > > > time. Neither of these options is workable.
> > > > 
> > > > Let’s take a step back though, because I think we’re getting lost in 
> > > > the weeds.
> > > > 
> > > > The repair design in the CEP has some high level concepts that make a 
> > > > lot of sense, the idea of repairing a grid is really smart. However, it 
> > > > has some significant drawbacks that remain unaddressed. I want this CEP 
> > > > to succeed, and I know Jon does too, but the snapshot repair design is 
> > > > not a viable path forward. It’s the first iteration of a repair design. 
> > > > We’ve proposed a second iteration, and we’re open to a third iteration. 
> > > > This part of the CEP process is meant to identify and address 
> > > > shortcomings, I don’t think that continuing to dissect the snapshot 
> > > > repair design is making progress in that direction.
> > > > 
> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > > > >  We potentially have to do it several times on each node, depending 
> > > > > > on the size of the range. Smaller ranges increase the size of the 
> > > > > > board exponentially, larger ranges increase the number of SSTables 
> > > > > > that would be involved in each compaction.
> > > > > As described in the CEP exampl

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Benedict Elliott Smith
Hmm, I am very surprised as I helped write that and I distinctly recall a 
specific goal was avoiding binding vetoes as they're so toxic.

Ok, I guess you can block this work if you like.

I don't see any outcome here that is good for the community though. Either 
Runtian caves and adopts your design that he (and I) consider inferior, or he 
is prevented from contributing this work. That isn't a functioning community in 
my mind, so I'll be noping out for a while, as I don't see much value here 
right now.


On 2025/06/06 18:31:08 Blake Eggleston wrote:
> Hi Benedict, that’s actually not true. 
> 
> Here’s a link to the project governance page: 
> _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
> 
> The CEP section says:
> 
> “*Once the proposal is finalized and any major committer dissent reconciled, 
> call a [VOTE] on the ML to have the proposal adopted. The criteria for 
> acceptance is consensus (3 binding +1 votes and no binding vetoes). The vote 
> should remain open for 72 hours.*”
> 
> So they’re definitely vetoable. 
> 
> Also note the part about “*Once the proposal is finalized and any major 
> committer dissent reconciled,*” being a prerequisite for moving a CEP to 
> [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even be 
> appropriate to move to a VOTE until we get to the bottom of this repair 
> discussion.
> 
> On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > > but the snapshot repair design is not a viable path forward. It’s the 
> > > first iteration of a repair design. We’ve proposed a second iteration, 
> > > and we’re open to a third iteration.
> > 
> > I shan't be participating further in discussion, but I want to make a point 
> > of order. The CEP process has no vetoes, so you are not empowered to 
> > declare that a design is not viable without the input of the wider 
> > community.
> > 
> > 
> > On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > > You can detect and fix the mismatch in a single round of repair, but the 
> > > amount of work needed to do it is _significantly_ higher with snapshot 
> > > repair. Consider a case where we have a 300 node cluster w/ RF 3, where 
> > > each view partition contains entries mapping to every token range in the 
> > > cluster - so 100 ranges. If we lose a view sstable, it will affect an 
> > > entire row/column of the grid. Repair is going to scan all data in the 
> > > mismatching view token ranges 100 times, and each base range once. So 
> > > you’re looking at 200 range scans.
> > > 
> > > Now, you may argue that you can merge the duplicate view scans into a 
> > > single scan while you repair all token ranges in parallel. I’m skeptical 
> > > that’s going to be achievable in practice, but even if it is, we’re now 
> > > talking about the view replica hypothetically doing a pairwise repair 
> > > with every other replica in the cluster at the same time. Neither of 
> > > these options is workable.
> > > 
> > > Let’s take a step back though, because I think we’re getting lost in the 
> > > weeds.
> > > 
> > > The repair design in the CEP has some high level concepts that make a lot 
> > > of sense, the idea of repairing a grid is really smart. However, it has 
> > > some significant drawbacks that remain unaddressed. I want this CEP to 
> > > succeed, and I know Jon does too, but the snapshot repair design is not a 
> > > viable path forward. It’s the first iteration of a repair design. We’ve 
> > > proposed a second iteration, and we’re open to a third iteration. This 
> > > part of the CEP process is meant to identify and address shortcomings, I 
> > > don’t think that continuing to dissect the snapshot repair design is 
> > > making progress in that direction.
> > > 
> > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > > >  We potentially have to do it several times on each node, depending 
> > > > > on the size of the range. Smaller ranges increase the size of the 
> > > > > board exponentially, larger ranges increase the number of SSTables 
> > > > > that would be involved in each compaction.
> > > > As described in the CEP example, this can be handled in a single round 
> > > > of repair. We first identify all the points in the grid that require 
> > > > repair, then perform anti-compaction and stream data based on a second 
> > > > scan over those identified points. This applies to the snapshot-based 
> > > > solution—without an index, repairing a single point in that grid 
> > > > requires scanning the entire base table partition (token range). In 
> > > > contrast, with the index-based solution—as in the example you 
> > > > referenced—if a large block of data is corrupted, even though the index 
> > > > is used for comparison, many key mismatches may occur. This can lead to 
> > > > random disk access to the original data files, which could cause 
> > > > performance issues. For the case you mentioned for snapshot based 
> > > > solution, it 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Blake Eggleston
Thanks for the updated table Runtian, I think it misses the point though. The 
problem with snapshot based repair is that it degrades operators ability to 
react to data problems by imposing a significant upfront processing burden on 
repair, and that it doesn’t scale well with cluster size, as I illustrated in 
my last email. These issues are non-starters and until they’re worked out, 
comparisons are premature. You’re not making an apples to apples comparison.

The original MV was a huge embarrassment for the project. It was really a low 
point for the credibility of the C* dev community and our ability to deliver 
features without critical design flaws. I think it’s a useful feature, and I’m 
all for fixing it, but if we’re going to go to the community and tell people 
that MVs are fixed and they can use them now, it needs to be bulletproof.

I feel good about the query execution side of the proposal, clearly a lot of 
thought has gone into designing a really solid consensus based MV system. The 
proposed repair design is not yet at the same level of maturity. Not needing 
repair is the right goal for the query execution design, but the repair design 
needs to be prepared for the worst, and it’s not. 

Repair is where a lot of the original MV use cases completely fell over. It was 
bad enough that views became inconsistent so easily, but the fact that trying 
to repair them could take down clusters was where it really became a disaster. 
Personally, I think shelving MV v2 would be better for the project than moving 
forward with a repair mechanism with these flaws. Users are going to have 
consistency problems, both self inflicted ones, and from bugs, and we need a 
solid and reliable system for fixing them if we’re going to advertise MVs as 
ready for production.

On Fri, Jun 6, 2025, at 2:03 AM, Runtian Liu wrote:
> Thanks, Blake and Jon, for your feedback—I really appreciate your time on 
> this topic and your efforts to help make this CEP a success. Throughout this 
> discussion, we've explored many interesting problems, and your input helped 
> me better understand how the index-based solution would work. Now that both 
> approaches are clearly understood, here’s a cost comparison based on 
> real-world use cases, as illustrated in the chart below. Please let me know 
> if anything else is missing in the table below; I can help facilitate a 
> comparison between both approaches so we, as an Apache Cassandra community, 
> can make an informed decision.
> 
> Operation
> 
> Resource allocated
> 
> Index based solution
> 
> Snapshot based solution
> 
> Periodic inconsistency detection (full data set)
> 
> CPU
> 
> Higher(Maintaining indexes increases CPU usage in the hot path and 
> compaction, leading to the need for higher CPU provisioning)
> 
> Lower
> 
> Memory
> 
> Higher(Maintaining indexes increases memory usage in the hot path and 
> compaction)
> 
> Lower
> 
> Disk
> 
> Lower
> 
> Higher(Snapshots use extra disk space if SSTables are compacted away, with 
> worst-case usage reaching 2×)
> 
> Adhoc Inconsistency detection (partial data set)
> 
> CPU/Memory/Disk
> 
> Supported + The cost is same as the above
> 
> Not supported
> 
> Data repair
> 
> CPU
> 
> Depending on the number of rows to be repaired, if many mismatches are 
> detected—such as a whole block of data missing during an outage recovery 
> scenario—the overhead of random data access in an index-based solution can be 
> high. However, for normal use cases where only a few rows in a token range 
> need to be fixed, the index-based solution requires significantly fewer 
> resources due to its row-level insight into the mismatched rows.
> 
> If only a few rows in a token range need to be fixed, this approach requires 
> scanning the entire partitions of both the base table and the MV table to 
> repair the data. Additionally, anti-compaction is required, which can result 
> in higher CPU usage. However, if a full token range needs to be rebuilt due 
> to a hardware failure, this approach becomes more efficient because it avoids 
> random disk access.
> 
> Disk
> 
> Only index files are exchanged; the actual data streamed is exactly the 
> inconsistent data needs to be repaired
> 
> Over-stream due to block of data streamed to the MV node
> 
> Topology Changes Challenge
> 
> One painful complexity
> 
> Indexes need to be rebuilt from time to time to maintain their effectiveness 
> 
> If all replicas are replaced during the detection phase, then the repair on 
> 100% of the data is not feasible, which means we have to wait till the next 
> cycle, delaying the overall duration
> 
> Overall
> 
> The overall resources required by the two approaches depend on the use case. 
> In general, the snapshot-based solution consumes more disk space, while the 
> index-based solution requires more CPU and memory.
> 
> 
> 
> I think that although this is called MV repair, it's quite different from 
> regular repairs in Cassandra. Standard repairs are 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Blake Eggleston
Hi Benedict, that’s actually not true. 

Here’s a link to the project governance page: 
_https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_

The CEP section says:

“*Once the proposal is finalized and any major committer dissent reconciled, 
call a [VOTE] on the ML to have the proposal adopted. The criteria for 
acceptance is consensus (3 binding +1 votes and no binding vetoes). The vote 
should remain open for 72 hours.*”

So they’re definitely vetoable. 

Also note the part about “*Once the proposal is finalized and any major 
committer dissent reconciled,*” being a prerequisite for moving a CEP to 
[VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even be 
appropriate to move to a VOTE until we get to the bottom of this repair 
discussion.

On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > but the snapshot repair design is not a viable path forward. It’s the first 
> > iteration of a repair design. We’ve proposed a second iteration, and we’re 
> > open to a third iteration.
> 
> I shan't be participating further in discussion, but I want to make a point 
> of order. The CEP process has no vetoes, so you are not empowered to declare 
> that a design is not viable without the input of the wider community.
> 
> 
> On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > You can detect and fix the mismatch in a single round of repair, but the 
> > amount of work needed to do it is _significantly_ higher with snapshot 
> > repair. Consider a case where we have a 300 node cluster w/ RF 3, where 
> > each view partition contains entries mapping to every token range in the 
> > cluster - so 100 ranges. If we lose a view sstable, it will affect an 
> > entire row/column of the grid. Repair is going to scan all data in the 
> > mismatching view token ranges 100 times, and each base range once. So 
> > you’re looking at 200 range scans.
> > 
> > Now, you may argue that you can merge the duplicate view scans into a 
> > single scan while you repair all token ranges in parallel. I’m skeptical 
> > that’s going to be achievable in practice, but even if it is, we’re now 
> > talking about the view replica hypothetically doing a pairwise repair with 
> > every other replica in the cluster at the same time. Neither of these 
> > options is workable.
> > 
> > Let’s take a step back though, because I think we’re getting lost in the 
> > weeds.
> > 
> > The repair design in the CEP has some high level concepts that make a lot 
> > of sense, the idea of repairing a grid is really smart. However, it has 
> > some significant drawbacks that remain unaddressed. I want this CEP to 
> > succeed, and I know Jon does too, but the snapshot repair design is not a 
> > viable path forward. It’s the first iteration of a repair design. We’ve 
> > proposed a second iteration, and we’re open to a third iteration. This part 
> > of the CEP process is meant to identify and address shortcomings, I don’t 
> > think that continuing to dissect the snapshot repair design is making 
> > progress in that direction.
> > 
> > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > >  We potentially have to do it several times on each node, depending on 
> > > > the size of the range. Smaller ranges increase the size of the board 
> > > > exponentially, larger ranges increase the number of SSTables that would 
> > > > be involved in each compaction.
> > > As described in the CEP example, this can be handled in a single round of 
> > > repair. We first identify all the points in the grid that require repair, 
> > > then perform anti-compaction and stream data based on a second scan over 
> > > those identified points. This applies to the snapshot-based 
> > > solution—without an index, repairing a single point in that grid requires 
> > > scanning the entire base table partition (token range). In contrast, with 
> > > the index-based solution—as in the example you referenced—if a large 
> > > block of data is corrupted, even though the index is used for comparison, 
> > > many key mismatches may occur. This can lead to random disk access to the 
> > > original data files, which could cause performance issues. For the case 
> > > you mentioned for snapshot based solution, it should not take months to 
> > > repair all the data, instead one round of repair should be enough. The 
> > > actual repair phase is split from the detection phase.
> > > 
> > > 
> > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad  
> > > wrote:
> > >> > This isn’t really the whole story. The amount of wasted scans on index 
> > >> > repairs is negligible. If a difference is detected with snapshot 
> > >> > repairs though, you have to read the entire partition from both the 
> > >> > view and base table to calculate what needs to be fixed.
> > >> 
> > >> You nailed it.
> > >> 
> > >> When the base table is converted to a view, and sent to the view, the 
> > >> information we have is that one of the view's partition key

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Benedict Elliott Smith
Without intending to get into further discussion, I would note that I think we 
already have the necessary facilities to enable compaction of the sstables 
being used to construct a snapshot, so that additional disk space is not 
required for snapshots. We use this already today to compact incoming streams. 

We would need to temporarily move the sstables we are using for a snapshot into 
a separate compaction strategy for the duration of the work, *not* mark them 
compacting (i.e. not treat it as a validation compaction), and then to 
periodically refresh the sstables we're actually reading from so that expired 
sstables can be released.

This would only prevent us compacting new data with snapshot data for the 
duration of the snapshot, which should have minimal overhead implications for 
most workloads.

On 2025/06/06 09:03:29 Runtian Liu wrote:
> Thanks, Blake and Jon, for your feedback—I really appreciate your time on
> this topic and your efforts to help make this CEP a success. Throughout
> this discussion, we've explored many interesting problems, and your input
> helped me better understand how the index-based solution would work. Now
> that both approaches are clearly understood, here’s a cost comparison based
> on real-world use cases, as illustrated in the chart below. Please let me
> know if anything else is missing in the table below; I can help facilitate
> a comparison between both approaches so we, as an Apache Cassandra
> community, can make an informed decision.
> 
> Operation
> 
> Resource allocated
> 
> Index based solution
> 
> Snapshot based solution
> 
> Periodic inconsistency detection (full data set)
> 
> CPU
> 
> Higher(Maintaining indexes increases CPU usage in the hot path and
> compaction, leading to the need for higher CPU provisioning)
> 
> Lower
> 
> Memory
> 
> Higher(Maintaining indexes increases memory usage in the hot path and
> compaction)
> 
> Lower
> 
> Disk
> 
> Lower
> 
> Higher(Snapshots use extra disk space if SSTables are compacted away, with
> worst-case usage reaching 2×)
> 
> Adhoc Inconsistency detection (partial data set)
> 
> CPU/Memory/Disk
> 
> Supported + The cost is same as the above
> 
> Not supported
> 
> Data repair
> 
> CPU
> 
> Depending on the number of rows to be repaired, if many mismatches are
> detected—such as a whole block of data missing during an outage recovery
> scenario—the overhead of random data access in an index-based solution can
> be high. However, for normal use cases where only a few rows in a token
> range need to be fixed, the index-based solution requires significantly
> fewer resources due to its row-level insight into the mismatched rows.
> 
> If only a few rows in a token range need to be fixed, this approach
> requires scanning the entire partitions of both the base table and the MV
> table to repair the data. Additionally, anti-compaction is required, which
> can result in higher CPU usage. However, if a full token range needs to be
> rebuilt due to a hardware failure, this approach becomes more efficient
> because it avoids random disk access.
> 
> Disk
> 
> Only index files are exchanged; the actual data streamed is exactly the
> inconsistent data needs to be repaired
> 
> Over-stream due to block of data streamed to the MV node
> 
> Topology Changes Challenge
> 
> One painful complexity
> 
> Indexes need to be rebuilt from time to time to maintain their
> effectiveness
> 
> If all replicas are replaced during the detection phase, then the repair on
> 100% of the data is not feasible, which means we have to wait till the next
> cycle, delaying the overall duration
> 
> Overall
> 
> The overall resources required by the two approaches depend on the use
> case. In general, the snapshot-based solution consumes more disk space,
> while the index-based solution requires more CPU and memory.
> 
> 
> I think that although this is called MV repair, it's quite different from
> regular repairs in Cassandra. Standard repairs are designed to compare
> replicas and update them to the latest version. That’s why they must
> complete within the tombstone gc_grace_seconds period to avoid data loss.
> However, in the case of MV repair, how do we define what’s “safe” in terms
> of data quality? Regardless of which solution we choose, this MV repair
> process doesn’t follow the same gc_grace_seconds requirement.
> 
> From my perspective, materialized views should always remain in sync with
> the base table, and ideally, no repair would be necessary. However, as
> outlined in the CEP, there are scenarios where we may need to monitor or
> repair inconsistencies between the two tables. While such repairs are
> necessary, they can—and should—remain infrequent. If our repair detection
> job consistently finds a large number of mismatches, I would prefer to
> address the root cause in the hot path rather than relying on repair.
> 
> As shown in the table above, there’s no silver bullet or universally simple
> solution. Ideally, we would suppor

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Runtian Liu
Thanks, Blake and Jon, for your feedback—I really appreciate your time on
this topic and your efforts to help make this CEP a success. Throughout
this discussion, we've explored many interesting problems, and your input
helped me better understand how the index-based solution would work. Now
that both approaches are clearly understood, here’s a cost comparison based
on real-world use cases, as illustrated in the chart below. Please let me
know if anything else is missing in the table below; I can help facilitate
a comparison between both approaches so we, as an Apache Cassandra
community, can make an informed decision.

Operation

Resource allocated

Index based solution

Snapshot based solution

Periodic inconsistency detection (full data set)

CPU

Higher(Maintaining indexes increases CPU usage in the hot path and
compaction, leading to the need for higher CPU provisioning)

Lower

Memory

Higher(Maintaining indexes increases memory usage in the hot path and
compaction)

Lower

Disk

Lower

Higher(Snapshots use extra disk space if SSTables are compacted away, with
worst-case usage reaching 2×)

Adhoc Inconsistency detection (partial data set)

CPU/Memory/Disk

Supported + The cost is same as the above

Not supported

Data repair

CPU

Depending on the number of rows to be repaired, if many mismatches are
detected—such as a whole block of data missing during an outage recovery
scenario—the overhead of random data access in an index-based solution can
be high. However, for normal use cases where only a few rows in a token
range need to be fixed, the index-based solution requires significantly
fewer resources due to its row-level insight into the mismatched rows.

If only a few rows in a token range need to be fixed, this approach
requires scanning the entire partitions of both the base table and the MV
table to repair the data. Additionally, anti-compaction is required, which
can result in higher CPU usage. However, if a full token range needs to be
rebuilt due to a hardware failure, this approach becomes more efficient
because it avoids random disk access.

Disk

Only index files are exchanged; the actual data streamed is exactly the
inconsistent data needs to be repaired

Over-stream due to block of data streamed to the MV node

Topology Changes Challenge

One painful complexity

Indexes need to be rebuilt from time to time to maintain their
effectiveness

If all replicas are replaced during the detection phase, then the repair on
100% of the data is not feasible, which means we have to wait till the next
cycle, delaying the overall duration

Overall

The overall resources required by the two approaches depend on the use
case. In general, the snapshot-based solution consumes more disk space,
while the index-based solution requires more CPU and memory.


I think that although this is called MV repair, it's quite different from
regular repairs in Cassandra. Standard repairs are designed to compare
replicas and update them to the latest version. That’s why they must
complete within the tombstone gc_grace_seconds period to avoid data loss.
However, in the case of MV repair, how do we define what’s “safe” in terms
of data quality? Regardless of which solution we choose, this MV repair
process doesn’t follow the same gc_grace_seconds requirement.

>From my perspective, materialized views should always remain in sync with
the base table, and ideally, no repair would be necessary. However, as
outlined in the CEP, there are scenarios where we may need to monitor or
repair inconsistencies between the two tables. While such repairs are
necessary, they can—and should—remain infrequent. If our repair detection
job consistently finds a large number of mismatches, I would prefer to
address the root cause in the hot path rather than relying on repair.

As shown in the table above, there’s no silver bullet or universally simple
solution. Ideally, we would support both options and let operators choose
based on their needs. However, given the complexity of implementing either
approach, we need to select one for the initial CEP deliverables.

Overall, I’m leaning toward the snapshot-based approach. The
trade-off—additional disk usage during infrequent repairs—seems more
acceptable than adding CPU and memory overhead to the hot path, especially
given that MV repair is expected to be an infrequent or on-demand
operation, unlike full or incremental repairs which must run regularly.


On Fri, Jun 6, 2025 at 4:32 PM Benedict Elliott Smith 
wrote:

> > but the snapshot repair design is not a viable path forward. It’s the
> first iteration of a repair design. We’ve proposed a second iteration, and
> we’re open to a third iteration.
>
> I shan't be participating further in discussion, but I want to make a
> point of order. The CEP process has no vetoes, so you are not empowered to
> declare that a design is not viable without the input of the wider
> community.
>
>
> On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > You can detect 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Benedict Elliott Smith
> but the snapshot repair design is not a viable path forward. It’s the first 
> iteration of a repair design. We’ve proposed a second iteration, and we’re 
> open to a third iteration.

I shan't be participating further in discussion, but I want to make a point of 
order. The CEP process has no vetoes, so you are not empowered to declare that 
a design is not viable without the input of the wider community.


On 2025/06/05 03:58:59 Blake Eggleston wrote:
> You can detect and fix the mismatch in a single round of repair, but the 
> amount of work needed to do it is _significantly_ higher with snapshot 
> repair. Consider a case where we have a 300 node cluster w/ RF 3, where each 
> view partition contains entries mapping to every token range in the cluster - 
> so 100 ranges. If we lose a view sstable, it will affect an entire row/column 
> of the grid. Repair is going to scan all data in the mismatching view token 
> ranges 100 times, and each base range once. So you’re looking at 200 range 
> scans.
> 
> Now, you may argue that you can merge the duplicate view scans into a single 
> scan while you repair all token ranges in parallel. I’m skeptical that’s 
> going to be achievable in practice, but even if it is, we’re now talking 
> about the view replica hypothetically doing a pairwise repair with every 
> other replica in the cluster at the same time. Neither of these options is 
> workable.
> 
> Let’s take a step back though, because I think we’re getting lost in the 
> weeds.
> 
> The repair design in the CEP has some high level concepts that make a lot of 
> sense, the idea of repairing a grid is really smart. However, it has some 
> significant drawbacks that remain unaddressed. I want this CEP to succeed, 
> and I know Jon does too, but the snapshot repair design is not a viable path 
> forward. It’s the first iteration of a repair design. We’ve proposed a second 
> iteration, and we’re open to a third iteration. This part of the CEP process 
> is meant to identify and address shortcomings, I don’t think that continuing 
> to dissect the snapshot repair design is making progress in that direction.
> 
> On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > >  We potentially have to do it several times on each node, depending on 
> > > the size of the range. Smaller ranges increase the size of the board 
> > > exponentially, larger ranges increase the number of SSTables that would 
> > > be involved in each compaction.
> > As described in the CEP example, this can be handled in a single round of 
> > repair. We first identify all the points in the grid that require repair, 
> > then perform anti-compaction and stream data based on a second scan over 
> > those identified points. This applies to the snapshot-based 
> > solution—without an index, repairing a single point in that grid requires 
> > scanning the entire base table partition (token range). In contrast, with 
> > the index-based solution—as in the example you referenced—if a large block 
> > of data is corrupted, even though the index is used for comparison, many 
> > key mismatches may occur. This can lead to random disk access to the 
> > original data files, which could cause performance issues. For the case you 
> > mentioned for snapshot based solution, it should not take months to repair 
> > all the data, instead one round of repair should be enough. The actual 
> > repair phase is split from the detection phase.
> > 
> > 
> > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad  wrote:
> >> > This isn’t really the whole story. The amount of wasted scans on index 
> >> > repairs is negligible. If a difference is detected with snapshot repairs 
> >> > though, you have to read the entire partition from both the view and 
> >> > base table to calculate what needs to be fixed.
> >> 
> >> You nailed it.
> >> 
> >> When the base table is converted to a view, and sent to the view, the 
> >> information we have is that one of the view's partition keys needs a 
> >> repair.  That's going to be different from the partition key of the base 
> >> table.  As a result, on the base table, for each affected range, we'd have 
> >> to issue another compaction across the entire set of sstables that could 
> >> have the data the view needs (potentially many GB), in order to send over 
> >> the corrected version of the partition, then send it over to the view.  
> >> Without an index in place, we have to do yet another scan, per-affected 
> >> range.  
> >> 
> >> Consider the case of a single corrupted SSTable on the view that's removed 
> >> from the filesystem, or the data is simply missing after being restored 
> >> from an inconsistent backup.  It presumably contains lots of partitions, 
> >> which maps to base partitions all over the cluster, in a lot of different 
> >> token ranges.  For every one of those ranges (hundreds, to tens of 
> >> thousands of them given the checkerboard design), when finding the missing 
> >> data in the base, you'll have to pe

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-04 Thread Blake Eggleston
You can detect and fix the mismatch in a single round of repair, but the amount 
of work needed to do it is _significantly_ higher with snapshot repair. 
Consider a case where we have a 300 node cluster w/ RF 3, where each view 
partition contains entries mapping to every token range in the cluster - so 100 
ranges. If we lose a view sstable, it will affect an entire row/column of the 
grid. Repair is going to scan all data in the mismatching view token ranges 100 
times, and each base range once. So you’re looking at 200 range scans.

Now, you may argue that you can merge the duplicate view scans into a single 
scan while you repair all token ranges in parallel. I’m skeptical that’s going 
to be achievable in practice, but even if it is, we’re now talking about the 
view replica hypothetically doing a pairwise repair with every other replica in 
the cluster at the same time. Neither of these options is workable.

Let’s take a step back though, because I think we’re getting lost in the weeds.

The repair design in the CEP has some high level concepts that make a lot of 
sense, the idea of repairing a grid is really smart. However, it has some 
significant drawbacks that remain unaddressed. I want this CEP to succeed, and 
I know Jon does too, but the snapshot repair design is not a viable path 
forward. It’s the first iteration of a repair design. We’ve proposed a second 
iteration, and we’re open to a third iteration. This part of the CEP process is 
meant to identify and address shortcomings, I don’t think that continuing to 
dissect the snapshot repair design is making progress in that direction.

On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> >  We potentially have to do it several times on each node, depending on the 
> > size of the range. Smaller ranges increase the size of the board 
> > exponentially, larger ranges increase the number of SSTables that would be 
> > involved in each compaction.
> As described in the CEP example, this can be handled in a single round of 
> repair. We first identify all the points in the grid that require repair, 
> then perform anti-compaction and stream data based on a second scan over 
> those identified points. This applies to the snapshot-based solution—without 
> an index, repairing a single point in that grid requires scanning the entire 
> base table partition (token range). In contrast, with the index-based 
> solution—as in the example you referenced—if a large block of data is 
> corrupted, even though the index is used for comparison, many key mismatches 
> may occur. This can lead to random disk access to the original data files, 
> which could cause performance issues. For the case you mentioned for snapshot 
> based solution, it should not take months to repair all the data, instead one 
> round of repair should be enough. The actual repair phase is split from the 
> detection phase.
> 
> 
> On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad  wrote:
>> > This isn’t really the whole story. The amount of wasted scans on index 
>> > repairs is negligible. If a difference is detected with snapshot repairs 
>> > though, you have to read the entire partition from both the view and base 
>> > table to calculate what needs to be fixed.
>> 
>> You nailed it.
>> 
>> When the base table is converted to a view, and sent to the view, the 
>> information we have is that one of the view's partition keys needs a repair. 
>>  That's going to be different from the partition key of the base table.  As 
>> a result, on the base table, for each affected range, we'd have to issue 
>> another compaction across the entire set of sstables that could have the 
>> data the view needs (potentially many GB), in order to send over the 
>> corrected version of the partition, then send it over to the view.  Without 
>> an index in place, we have to do yet another scan, per-affected range.  
>> 
>> Consider the case of a single corrupted SSTable on the view that's removed 
>> from the filesystem, or the data is simply missing after being restored from 
>> an inconsistent backup.  It presumably contains lots of partitions, which 
>> maps to base partitions all over the cluster, in a lot of different token 
>> ranges.  For every one of those ranges (hundreds, to tens of thousands of 
>> them given the checkerboard design), when finding the missing data in the 
>> base, you'll have to perform a compaction across all the SSTables that 
>> potentially contain the missing data just to rebuild the view-oriented 
>> partitions that need to be sent to the view.  The complexity of this 
>> operation can be looked at as O(N*M) where N and M are the number of ranges 
>> in the base table and the view affected by the corruption, respectively.  
>> Without an index in place, finding the missing data is very expensive.  We 
>> potentially have to do it several times on each node, depending on the size 
>> of the range.  Smaller ranges increase the size of the board exponentially, 
>> larger ranges in

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-04 Thread Runtian Liu
>  We potentially have to do it several times on each node, depending on
the size of the range. Smaller ranges increase the size of the board
exponentially, larger ranges increase the number of SSTables that would be
involved in each compaction.
As described in the CEP example, this can be handled in a single round of
repair. We first identify all the points in the grid that require repair,
then perform anti-compaction and stream data based on a second scan over
those identified points. This applies to the snapshot-based
solution—without an index, repairing a single point in that grid requires
scanning the entire base table partition (token range). In contrast, with
the index-based solution—as in the example you referenced—if a large block
of data is corrupted, even though the index is used for comparison, many
key mismatches may occur. This can lead to random disk access to the
original data files, which could cause performance issues. For the case you
mentioned for snapshot based solution, it should not take months to repair
all the data, instead one round of repair should be enough. The actual
repair phase is split from the detection phase.


On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad  wrote:

> > This isn’t really the whole story. The amount of wasted scans on index
> repairs is negligible. If a difference is detected with snapshot repairs
> though, you have to read the entire partition from both the view and base
> table to calculate what needs to be fixed.
>
> You nailed it.
>
> When the base table is converted to a view, and sent to the view, the
> information we have is that one of the view's partition keys needs a
> repair.  That's going to be different from the partition key of the base
> table.  As a result, on the base table, for each affected range, we'd have
> to issue another compaction across the entire set of sstables that could
> have the data the view needs (potentially many GB), in order to send over
> the corrected version of the partition, then send it over to the view.
> Without an index in place, we have to do yet another scan, per-affected
> range.
>
> Consider the case of a single corrupted SSTable on the view that's removed
> from the filesystem, or the data is simply missing after being restored
> from an inconsistent backup.  It presumably contains lots of partitions,
> which maps to base partitions all over the cluster, in a lot of different
> token ranges.  For every one of those ranges (hundreds, to tens of
> thousands of them given the checkerboard design), when finding the missing
> data in the base, you'll have to perform a compaction across all the
> SSTables that potentially contain the missing data just to rebuild the
> view-oriented partitions that need to be sent to the view.  The complexity
> of this operation can be looked at as O(N*M) where N and M are the number
> of ranges in the base table and the view affected by the corruption,
> respectively.  Without an index in place, finding the missing data is very
> expensive.  We potentially have to do it several times on each node,
> depending on the size of the range.  Smaller ranges increase the size of
> the board exponentially, larger ranges increase the number of SSTables that
> would be involved in each compaction.
>
> Then you send that data over to the view, the view does it's
> anti-compaction thing, again, once per affected range.  So now the view has
> to do an anti-compaction once per block on the board that's affected by the
> missing data.
>
> Doing hundreds or thousands of these will add up pretty quickly.
>
> When I said that a repair could take months, this is what I had in mind.
>
>
>
>
> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston 
> wrote:
>
>> > Adds overhead in the hot path due to maintaining indexes. Extra memory
>> needed during write path and compaction.
>>
>> I’d make the same argument about the overhead of maintaining the index
>> that Jon just made about the disk space required. The relatively
>> predictable overhead of maintaining the index as part of the write and
>> compaction paths is a pro, not a con. Although you’re not always paying the
>> cost of building a merkle tree with snapshot repair, it can impact the hot
>> path and you do have to plan for it.
>>
>> > Verifies index content, not actual data—may miss low-probability errors
>> like bit flips
>>
>> Presumably this could be handled by the views performing repair against
>> each other? You could also periodically rebuild the index or perform
>> checksums against the sstable content.
>>
>> > Extra data scan during inconsistency detection
>> > Index: Since the data covered by certain indexes is not guaranteed to
>> be fully contained within a single node as the topology changes, some data
>> scans may be wasted.
>> > Snapshots: No extra data scan
>>
>> This isn’t really the whole story. The amount of wasted scans on index
>> repairs is negligible. If a difference is detected with snapshot repairs
>> though, you have to read the

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-04 Thread Jon Haddad
> This isn’t really the whole story. The amount of wasted scans on index
repairs is negligible. If a difference is detected with snapshot repairs
though, you have to read the entire partition from both the view and base
table to calculate what needs to be fixed.

You nailed it.

When the base table is converted to a view, and sent to the view, the
information we have is that one of the view's partition keys needs a
repair.  That's going to be different from the partition key of the base
table.  As a result, on the base table, for each affected range, we'd have
to issue another compaction across the entire set of sstables that could
have the data the view needs (potentially many GB), in order to send over
the corrected version of the partition, then send it over to the view.
Without an index in place, we have to do yet another scan, per-affected
range.

Consider the case of a single corrupted SSTable on the view that's removed
from the filesystem, or the data is simply missing after being restored
from an inconsistent backup.  It presumably contains lots of partitions,
which maps to base partitions all over the cluster, in a lot of different
token ranges.  For every one of those ranges (hundreds, to tens of
thousands of them given the checkerboard design), when finding the missing
data in the base, you'll have to perform a compaction across all the
SSTables that potentially contain the missing data just to rebuild the
view-oriented partitions that need to be sent to the view.  The complexity
of this operation can be looked at as O(N*M) where N and M are the number
of ranges in the base table and the view affected by the corruption,
respectively.  Without an index in place, finding the missing data is very
expensive.  We potentially have to do it several times on each node,
depending on the size of the range.  Smaller ranges increase the size of
the board exponentially, larger ranges increase the number of SSTables that
would be involved in each compaction.

Then you send that data over to the view, the view does it's
anti-compaction thing, again, once per affected range.  So now the view has
to do an anti-compaction once per block on the board that's affected by the
missing data.

Doing hundreds or thousands of these will add up pretty quickly.

When I said that a repair could take months, this is what I had in mind.




On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston 
wrote:

> > Adds overhead in the hot path due to maintaining indexes. Extra memory
> needed during write path and compaction.
>
> I’d make the same argument about the overhead of maintaining the index
> that Jon just made about the disk space required. The relatively
> predictable overhead of maintaining the index as part of the write and
> compaction paths is a pro, not a con. Although you’re not always paying the
> cost of building a merkle tree with snapshot repair, it can impact the hot
> path and you do have to plan for it.
>
> > Verifies index content, not actual data—may miss low-probability errors
> like bit flips
>
> Presumably this could be handled by the views performing repair against
> each other? You could also periodically rebuild the index or perform
> checksums against the sstable content.
>
> > Extra data scan during inconsistency detection
> > Index: Since the data covered by certain indexes is not guaranteed to be
> fully contained within a single node as the topology changes, some data
> scans may be wasted.
> > Snapshots: No extra data scan
>
> This isn’t really the whole story. The amount of wasted scans on index
> repairs is negligible. If a difference is detected with snapshot repairs
> though, you have to read the entire partition from both the view and base
> table to calculate what needs to be fixed.
>
> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>
> One practical aspect that isn't immediately obvious is the disk space
> consideration for snapshots.
>
> When you have a table with a mixed workload using LCS or UCS with scaling
> parameters like L10 and initiate a repair, the disk usage will increase as
> long as the snapshot persists and the table continues to receive writes.
> This aspect is understood and factored into the design.
>
> However, a more nuanced point is the necessity to maintain sufficient disk
> headroom specifically for running repairs. This echoes the challenge with
> STCS compaction, where enough space must be available to accommodate the
> largest SSTables, even when they are not being actively compacted.
>
> For example, if a repair involves rewriting 100GB of SSTable data, you'll
> consistently need to reserve 100GB of free space to facilitate this.
>
> Therefore, while the snapshot-based approach leads to variable disk space
> utilization, operators must provision storage as if the maximum potential
> space will be used at all times to ensure repairs can be executed.
>
> This introduces a rate of churn dynamic, where the write throughput
> dictates the required extra disk space, rath

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-03 Thread Blake Eggleston
> Adds overhead in the hot path due to maintaining indexes. Extra memory needed 
> during write path and compaction.

I’d make the same argument about the overhead of maintaining the index that Jon 
just made about the disk space required. The relatively predictable overhead of 
maintaining the index as part of the write and compaction paths is a pro, not a 
con. Although you’re not always paying the cost of building a merkle tree with 
snapshot repair, it can impact the hot path and you do have to plan for it.

> Verifies index content, not actual data—may miss low-probability errors like 
> bit flips

Presumably this could be handled by the views performing repair against each 
other? You could also periodically rebuild the index or perform checksums 
against the sstable content.

> Extra data scan during inconsistency detection
> Index: Since the data covered by certain indexes is not guaranteed to be 
> fully contained within a single node as the topology changes, some data scans 
> may be wasted.
> Snapshots: No extra data scan

This isn’t really the whole story. The amount of wasted scans on index repairs 
is negligible. If a difference is detected with snapshot repairs though, you 
have to read the entire partition from both the view and base table to 
calculate what needs to be fixed.

On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
> One practical aspect that isn't immediately obvious is the disk space 
> consideration for snapshots.
> 
> When you have a table with a mixed workload using LCS or UCS with scaling 
> parameters like L10 and initiate a repair, the disk usage will increase as 
> long as the snapshot persists and the table continues to receive writes. This 
> aspect is understood and factored into the design.
> 
> However, a more nuanced point is the necessity to maintain sufficient disk 
> headroom specifically for running repairs. This echoes the challenge with 
> STCS compaction, where enough space must be available to accommodate the 
> largest SSTables, even when they are not being actively compacted.
> 
> For example, if a repair involves rewriting 100GB of SSTable data, you'll 
> consistently need to reserve 100GB of free space to facilitate this.
> 
> Therefore, while the snapshot-based approach leads to variable disk space 
> utilization, operators must provision storage as if the maximum potential 
> space will be used at all times to ensure repairs can be executed.
> 
> This introduces a rate of churn dynamic, where the write throughput dictates 
> the required extra disk space, rather than the existing on-disk data volume.
> 
> If 50% of your SSTables are rewritten during a snapshot, you would need 50% 
> free disk space. Depending on the workload, the snapshot method could consume 
> significantly more disk space than an index-based approach. Conversely, for 
> relatively static workloads, the index method might require more space. It's 
> not as straightforward as stating "No extra disk space needed".
> 
> Jon
> 
> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu  wrote:
>> > Regarding your comparison between approaches, I think you also need to 
>> > take into account the other dimensions that have been brought up in this 
>> > thread. Things like minimum repair times and vulnerability to outages and 
>> > topology changes are the first that come to mind.
>> 
>> Sure, I added a few more points.
>> 
>> *Perspective*
>> 
>> *Index-Based Solution*
>> 
>> *Snapshot-Based Solution*
>> 
>> 1. Hot path overhead
>> 
>> Adds overhead in the hot path due to maintaining indexes. Extra memory 
>> needed during write path and compaction.
>> 
>> No impact on the hot path
>> 
>> 2. Extra disk usage when repair is not running
>> 
>> Requires additional disk space to store persistent indexes
>> 
>> No extra disk space needed
>> 
>> 3. Extra disk usage during repair
>> 
>> Minimal or no additional disk usage
>> 
>> Requires additional disk space for snapshots
>> 
>> 4. Fine-grained repair  to deal with emergency situations / topology changes
>> 
>> Supports fine-grained repairs by targeting specific index ranges. This 
>> allows repair to be retried on smaller data sets, enabling incremental 
>> progress when repairing the entire table. This is especially helpful when 
>> there are down nodes or topology changes during repair, which are common in 
>> day-to-day operations.
>> 
>> Coordination across all nodes is required over a long period of time. For 
>> each round of repair, if all replica nodes are down or if there is a 
>> topology change, the data ranges that were not covered will need to be 
>> repaired in the next round.
>> 
>> 
>> 5. Validating data used in reads directly
>> 
>> Verifies index content, not actual data—may miss low-probability errors like 
>> bit flips
>> 
>> Verifies actual data content, providing stronger correctness guarantees
>> 
>> 6. Extra data scan during inconsistency detection
>> 
>> Since the data covered by certain indexes is not guaranteed to be fully 
>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-27 Thread Benedict
Are you and Blake proposing to implement it?On 27 May 2025, at 19:37, Jon Haddad  wrote:If the goal of this proposal is to move MVs to a state of being production ready, then it shouldn't have glaring holes.  We've already identified that the proposed global snapshots have these issues:1. The data being compared is going to be out of date very quickly2. Repair operations can be interrupted and invalidated by cluster topology changes3. The process is incredibly brittle, requiring a lot of global coordination to take place over a long period of time.4. Reconciling the view's inconsistencies is incredibly expensive.  Using incremental repair can result in multiple GB of anti-compaction to take place just to resolve a single row.  I think the part of the proposal that moves coordination to Paxos could move forward on it's own, as it's a significant incremental improvement.  To me though, if we're going to call something prod ready, it needs a repair process that can be run concurrently with topology changes. *** No internal C* process for resolving data consistency should block cluster expansion for weeks at a time or vice versa***Blake and I have been working on a proposal which solves these core problems.JonOn Tue, May 27, 2025 at 10:07 AM Blake Eggleston  wrote:Sure I understand that part, it’s kind of beside the point though. The problem is that when a user notices something is wrong with their data, they typically try to repair it, often with a targeted subrange repair. If the cluster has to do a ton of work before that’s possible, then that trick isn’t really useful anymore.On Sat, May 24, 2025, at 1:28 PM, Jaydeep Chovatia wrote:>I think we reach the halfway complete point once 70% of the time has elapsed, vs 50% for regular repair. >Do check my working though, this was put together without much validation.The formula seems correct to me—it is quadratic in that the rate of increase gets faster and faster over time. I have verified this with the following example: If we take k=2 and N=5, the grid will be 10x10, so at the 7th time unit, we can compare 49 Grid cells.TimeCells IntersectingUsability (%)11/100124/100439/1009416/10016525/10025...749/10049..10100/100100JaydeepOn Fri, May 23, 2025 at 2:17 PM Benedict Elliott Smith  wrote:I think a small number of overlapping points in the grid should occur relatively quickly, and the rate of overlapping grid points should accelerate as the snapshot process continues.To expand on this a little, let’s say you split a cluster of N shards up into kN slices, so that there are (kN)^2 merkle trees in the grid. Each replica then participates in (k^2)N merkle trees. Merkle trees are produced at a rate of 1 per time t on each replica, so that the snapshot finishes at t=(k^2)N.Base table and view replicas process their data orthogonally, but every kN merkle trees produced by a replica represents a complete column or rowThat is, at time t=kN we have produced kN^2 merkle trees, of which N^2 are ready to process (N rows intersecting N columns); but at t=2kN we have 2kN^2 trees but 4N^2 are ready to process (2N rows intersecting 2N columns).I think we reach the halfway complete point once 70% of the time has elapsed, vs 50% for regular repair. This doesn’t sound so bad to me. Do check my working though, this was put together without much validation.On 22 May 2025, at 20:32, Benedict Elliott Smith  wrote:That’s assuming there isn’t actually repair work that need to be done. As you go around the cluster doing repairs, the average age gets older. By virtue of running on an active cluster, there will be partitions that are repaired even if everything is working properly.That assumes the snapshot is out of sync. I don’t know how this is spelled out in the proposal, but you would want to compare data around the cluster as of the same point in time - ideally, probably filtered to include everything with a timestamp below some point that occurs shortly after the snapshot is initiated, so that every replica should match. If correcting a similar MV issue is going to be blocked by a snapshot/tree build that’s going to take days - longer if nodes are down - that’s a non-starter for real world use cases.This statement has three assumptions, that I do not necessarily agree with:1) That it takes days to build even in an emergency. The cost to build is an empirical question we can answer once the solution exists, but I don’t see this as obviously true. We could perhaps estimate this by seeing how quickly a validation cycle could be run today, given some CPU budget (say 15%).2) That we can’t start work until the whole snapshot is built. I don’t think this is true. I think a small number of overlapping points in the grid should occur relatively quickly, and the rate of overlapping grid points should accelerate as the snapshot process continues.3) That this is a blocker for "real world use cases”. In the real world, resolutions to 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-21 Thread Benedict
It’s an additional piece of work. If you need to be able to rebuild this data, then you need the original proposal either way. This proposal to maintain a live updating snapshot is therefore an additional feature on top of the MVP proposed.I don’t think this new proposal is fully fleshed out, I have a lot more questions about it than I do about the original proposal. I also don’t think it is healthy to weigh down other contributors’ proposals that move the state of the database forwards with our own design goals, without very strong justification. I think it is fine to say to users: repair exists, it shouldn’t ordinarily need to be run, but if you do it will for now require a snapshotting process. This would be acceptable to me as an operator (workload dependent), and I’m sure it would be acceptable to others, including the contributors undertaking the work.In the meantime there’s time to sketch out how such an online update process would work in more detail. I think it is achievable but much less obvious than building a snapshot.On 21 May 2025, at 21:00, Blake Eggleston  wrote:I don’t think it’s trivial, but I also don’t think it’s any more difficult then adding a mechanism to snapshot and build this merkle tree grid. Remember, we can’t just start a full cluster wide table scan at full blast everytime we want to start a new repair cycle. There’s going to need to be some gradual build coordination.I also don’t think this belongs in a follow on task. I think that the original proposal is incomplete without a better repair story and I’m not sure I’d support the cep if it proceeded as is.On Wed, May 21, 2025, at 12:22 PM, Benedict wrote:Depending how long the grid structure takes to build, there is perhaps anyway value in being able to update the snapshot after construction, so that when the repair is performed it is as up to date as possible. But, I don’t think this is trivial? I have some ideas how this might be done but they aren’t ideal, and could be costly or error prone. Have you sketched out a mechanism for this?I like the idea that the same mechanism could be used to build a one-off snapshot as maintain a live on, leaving the operator to decide what they prefer. Since this seems like an extension to the original proposal, I would suggest the original proposal is advanced and live updates to the snapshot is developed in follow up work.On 21 May 2025, at 17:45, Blake Eggleston  wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.Yes, it is. FWIW I think the grid structure is the right approach conceptually, but it has some drawbacks as proposed that I think we should try to improve. The simple index approach takes the same high level approach with a different set of tradeoffs2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.That's true. You could force a rebuild of the marker table if you knew there was a problem, and it might even be a good idea to have a process that slowly rebuilds the table in the background. The important part is that there's a low upfront cost to starting a repair, and that individual range pairs can be repaired quickly (by repair standards).Another advantage to having a more lightweight repair mechanism is that we can tolerate riskier client patterns. For instance, if we went with a paxos based approach, I think LOCAL_SERIAL writes would become acceptable, whereas I think we'd only be able to allow SERIAL with the heavier repair.On Tue, May 20, 2025, at 8:37 PM, Jaydeep Chovatia wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.JaydeepOn Tue, May 20, 2025 at 5:31 PM Blake Eggleston  wrote:I had an idea that’s a kind of a hybrid between the index approach and the merkle tree approach. Basically we keep something kind of like an index that only contains a hash of the data between a base partition and view partition intersection. So it would structure data like this:view_range -> base_token -> view_token -> contents

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-21 Thread Jon Haddad
I agree with Blake.  These are perfectly reasonable discussions to have up
front.

Snapshot based repair has a huge downside in that you're repairing data
that's days or weeks old.  There's going to be issues that arise from that
especially since the deletes that are recorded on the MV aren't going to be
stored anywhere else.  Jeff's brought up the tombstone consistency problem
a few times, I can't recall having seen an answer to that.  If we're trying
to make these production ready, these need to actually work correctly.  The
time to complete a repair *has* to be reasonable as well, because if it
does take weeks, then we're basically saying backup / restore and MVs can't
be used together due to time to restore.  Either that or we need to solve
the problem of eventually consistent backups.

To repeat myself from an earlier message in the thread:

"Add acceptance criteria added to the CEP that this has been tested at a
reasonably large scale, preferably with a base table dataset of *at least*
100TB (preferably more) & significant variance in both base & MV partition
size, prior to merge. "

Having *some* acceptance criteria, for me, is non-negotiable.

I still don't think building anything new on Paxos makes any sense at all,
given that this would go into C* 7.0, not 6.0.  If Accord isn't ready for
something like this (I think it's kind of the perfect use case) then we
should be having a different discussion - one that involves removing
Accord.

If we were to bring this to a vote today, I'm afraid I'd be deciding
between a -.9 (extreme disapproval but non-blocking) and a -1 for my
concerns around correctness, use of global, definitely stale snapshots, and
the choice of Paxos over Accord.  I think there's a lot of really good
stuff in the proposal, but what I'm seeing in there doesn't look like
something I would recommend people use.

Jon

On Wed, May 21, 2025 at 1:00 PM Blake Eggleston 
wrote:

> I don’t think it’s trivial, but I also don’t think it’s any more difficult
> then adding a mechanism to snapshot and build this merkle tree grid.
> Remember, we can’t just start a full cluster wide table scan at full blast
> everytime we want to start a new repair cycle. There’s going to need to be
> some gradual build coordination.
>
> I also don’t think this belongs in a follow on task. I think that the
> original proposal is incomplete without a better repair story and I’m not
> sure I’d support the cep if it proceeded as is.
>
> On Wed, May 21, 2025, at 12:22 PM, Benedict wrote:
>
>
> Depending how long the grid structure takes to build, there is perhaps
> anyway value in being able to update the snapshot after construction, so
> that when the repair is performed it is as up to date as possible. But, I
> don’t think this is trivial? I have some ideas how this might be done but
> they aren’t ideal, and could be costly or error prone. Have you sketched
> out a mechanism for this?
>
> I like the idea that the same mechanism could be used to build a one-off
> snapshot as maintain a live on, leaving the operator to decide what they
> prefer.
>
> Since this seems like an extension to the original proposal, I would
> suggest the original proposal is advanced and live updates to the snapshot
> is developed in follow up work.
>
>
> On 21 May 2025, at 17:45, Blake Eggleston  wrote:
>
> 
>
> 1. Isn't this hybrid approach conceptually similar to the grid structure
> described in the proposal? The main distinction is that the original
> proposal involves recomputing the entire grid during each repair cycle. In
> contrast, the approach outlined below optimizes this by only reconstructing
> individual cells marked as dirty due to recent updates.
>
>
> Yes, it is. FWIW I think the grid structure is the right approach
> conceptually, but it has some drawbacks as proposed that I think we should
> try to improve. The simple index approach takes the same high level
> approach with a different set of tradeoffs
>
> 2. If we adopt the dirty marker approach, it may not account for cases
> where there were no user writes, but inconsistencies still arose between
> the base and the view—such as those caused by SSTable bit rot, streaming
> anomalies, or other low-level issues.
>
>
> That's true. You could force a rebuild of the marker table if you knew
> there was a problem, and it might even be a good idea to have a process
> that slowly rebuilds the table in the background. The important part is
> that there's a low upfront cost to starting a repair, and that individual
> range pairs can be repaired quickly (by repair standards).
>
> Another advantage to having a more lightweight repair mechanism is that we
> can tolerate riskier client patterns. For instance, if we went with a paxos
> based approach, I think LOCAL_SERIAL writes would become acceptable,
> whereas I think we'd only be able to allow SERIAL with the heavier repair.
>
> On Tue, May 20, 2025, at 8:37 PM, Jaydeep Chovatia wrote:
>
> 1. Isn't this hybrid approach conceptually

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-21 Thread Benedict
Depending how long the grid structure takes to build, there is perhaps anyway value in being able to update the snapshot after construction, so that when the repair is performed it is as up to date as possible. But, I don’t think this is trivial? I have some ideas how this might be done but they aren’t ideal, and could be costly or error prone. Have you sketched out a mechanism for this?I like the idea that the same mechanism could be used to build a one-off snapshot as maintain a live on, leaving the operator to decide what they prefer. Since this seems like an extension to the original proposal, I would suggest the original proposal is advanced and live updates to the snapshot is developed in follow up work.On 21 May 2025, at 17:45, Blake Eggleston  wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.Yes, it is. FWIW I think the grid structure is the right approach conceptually, but it has some drawbacks as proposed that I think we should try to improve. The simple index approach takes the same high level approach with a different set of tradeoffs2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.That's true. You could force a rebuild of the marker table if you knew there was a problem, and it might even be a good idea to have a process that slowly rebuilds the table in the background. The important part is that there's a low upfront cost to starting a repair, and that individual range pairs can be repaired quickly (by repair standards).Another advantage to having a more lightweight repair mechanism is that we can tolerate riskier client patterns. For instance, if we went with a paxos based approach, I think LOCAL_SERIAL writes would become acceptable, whereas I think we'd only be able to allow SERIAL with the heavier repair.On Tue, May 20, 2025, at 8:37 PM, Jaydeep Chovatia wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.JaydeepOn Tue, May 20, 2025 at 5:31 PM Blake Eggleston  wrote:I had an idea that’s a kind of a hybrid between the index approach and the merkle tree approach. Basically we keep something kind of like an index that only contains a hash of the data between a base partition and view partition intersection. So it would structure data like this:view_range -> base_token -> view_token -> contents_hashBoth the base and view would maintain identical structures, and instead of trying to keep the hash always up to date with the data on disk, like an index, we would just mark base/view token combos as dirty when we get a write to a given base/view token combo. When we do a repair on a base/view range intersection, we recompute the content hashes for any dirty entries. Possibly with a background job that makes sure we don’t accumulate too many dirty entries if repair isn’t running often or something.So that’s 3 longs for each base/view intersection, comparable via sequential reads, and would allow us to quickly detect any inconsistencies between the base and view.On Tue, May 20, 2025, at 2:33 PM, Jaydeep Chovatia wrote:>* Consistency question: In the case where a base table gets a corrupt SSTable and is scrubbed, when it repairs against the view, without tracking the deletes against the secondary table, do we end up pushing the lack of data into the MV? >I think we'd still need to combine the output from the other replicas so that doesn't happen.SSTable corruption is one potential issue, but there are additional scenarios to consider—such as a node missing data. For this reason, the example in the proposal includes all replicas when performing materialized view (MV) repair, as a precaution to ensure better Base<->MV consistency. Here's a snippet...JaydeepOn Tue, May 20, 2025 at 1:55 PM Blake Eggleston  wrote:* Consistency question: In the case where a base table gets a corrupt 
SSTable and is scrubbed, when it repairs against the view, without 
tracking the deletes against the secondary table, do we end up push

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-20 Thread Jon Haddad
More questions and thoughts...

* Consistency question: In the case where a base table gets a corrupt
SSTable and is scrubbed, when it repairs against the view, without tracking
the deletes against the secondary table, do we end up pushing the lack of
data into the MV?

* I threw out the idea earlier in the thread that we could track tombstones
in something attached to the SSTable (tombstone index?).  I'm curious what
you all think about this.  Without it, we don't have a way of knowing
what's a delete and what's missing.  With it, we simply record all the
deletes that happened in the MV as a consequence of base table updates, and
we store it as an SSTable component.

* Is there a case where with Accord, 2 transactions can get the same
timestamp?  If the MVs are managed with Accord (or Paxos), can we just rely
on a calculation from the most recent cell's timestamps to determine if
we've got missed writes?

* If the above is true, then could the index simply be the view's
orientation of the data + the timestamp of the last cell write?  That
should compress exceptionally well especially if we are using tries. Since
the cells would be written and merged in the MV's partition order, we could
still create a Merkle tree and it could maintain all of it's current
insertion properties, we'd just have a different scheme for generating the
hashes - just using cell's timestamps.

* With the two SSTable components, a tombstone index & a MV index, (if all
my assumptions above are correct) I think we should have all the data we
need to detect inconsistencies.

* Seems like we should be repairing against the base table before we repair
the MV, or at least do in conjunction when we repair a range segment.

If we can do the 2 components it should significant cut down on repair time
and give us better consistency.  The downside is that it doesn't give us a
good path to fix existing tables.

Thoughts?

Jon


On Mon, May 19, 2025 at 4:20 PM Blake Eggleston 
wrote:

> Right, we can’t literally use merkle trees. What I mean is that it’s worth
> looking into alternate index schemes that could work for our use case
> though. There are a lot of encoding schemes out there designed to detect
> errors. Even something probabalistic that caused us to over-repair by some
> amount might be ok.
>
> On Mon, May 19, 2025, at 2:15 PM, Runtian Liu wrote:
>
> > You don’t need to duplicate the full data set in the index, you just
> need enough info to detect that something is missing.
> Could you please explain how this would work?
> If we build Merkle trees or compute hashes at the SSTable level, how would
> this case be handled?
> For example, consider the following table schema:
> CREATE TABLE (
>   pk int PRIMARY KEY,
>   v1 int,
>   v2 int,
>   v3 int
> );
>
> Suppose the data is stored as follows:
> *Node 1*
>
>
>-
>
>SSTable1: (1, 1, null, 1)
>-
>
>SSTable2: (1, null, 1, 1)
>
> *Node 2*
>
>-
>
>SSTable3: (1, 1, 1, 1)
>
> How can we ensure that a hash or Merkle tree computed at the SSTable level
> would produce the same result on both nodes for this row?
>
> On Mon, May 19, 2025 at 1:54 PM Jon Haddad 
> wrote:
>
> We could also track the deletes that need to be made to the view, in
> another SSTable component on the base.  That way you can actually do repair
> with tombstones.
>
>
>
>
> On Mon, May 19, 2025 at 11:37 AM Blake Eggleston 
> wrote:
>
>
> If we went the storage attached route then I think you’d just need more
> memory for the memtable, compaction would just be combining 2 sorted sets,
> though there would probably be some additional work related to deletes,
> overwrites, and tombstone purging.
>
> Regarding the size of the index, I think Jon was on the right track with
> his sstable attached merkle tree idea. You don’t need to duplicate the full
> data set in the index, you just need enough info to detect that something
> is missing. If you can detect that view partition x is missing data from
> base partition y, then you could start comparing the actual partition data
> and figure out who’s missing what.
>
> On Sun, May 18, 2025, at 9:20 PM, Runtian Liu wrote:
>
> > If you had a custom SAI index or something, this isn’t something you’d
> need to worry about
> This is what I missed.
>
> I think this could be a potential solution, but comparing indexes alone
> isn’t sufficient—it only handles cases where the MV has extra or missing
> rows. It doesn’t catch data mismatches for rows that exist in both the base
> table and MV. To address that, we may need to extend SAI for MV to store
> the entire selected dataset in the index file, applying the same approach
> to MV as we do for the base table. This would increase storage to roughly
> 4x per MV, compared to the current 2x, but it would help avoid random disk
> access during repair. I’m not sure if this would introduce any memory
> issues during compaction.
>
>
>
> On Sun, May 18, 2025 at 8:09 PM Blake Eggleston 
> wrote:
>
>
> It *might* be more efficient, 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-19 Thread Runtian Liu
> You don’t need to duplicate the full data set in the index, you just need
enough info to detect that something is missing.
Could you please explain how this would work?
If we build Merkle trees or compute hashes at the SSTable level, how would
this case be handled?
For example, consider the following table schema:
CREATE TABLE (
  pk int PRIMARY KEY,
  v1 int,
  v2 int,
  v3 int
);

Suppose the data is stored as follows:
*Node 1*

   -

   SSTable1: (1, 1, null, 1)
   -

   SSTable2: (1, null, 1, 1)

*Node 2*

   -

   SSTable3: (1, 1, 1, 1)

How can we ensure that a hash or Merkle tree computed at the SSTable level
would produce the same result on both nodes for this row?

On Mon, May 19, 2025 at 1:54 PM Jon Haddad  wrote:

> We could also track the deletes that need to be made to the view, in
> another SSTable component on the base.  That way you can actually do repair
> with tombstones.
>
>
>
>
> On Mon, May 19, 2025 at 11:37 AM Blake Eggleston 
> wrote:
>
>> If we went the storage attached route then I think you’d just need more
>> memory for the memtable, compaction would just be combining 2 sorted sets,
>> though there would probably be some additional work related to deletes,
>> overwrites, and tombstone purging.
>>
>> Regarding the size of the index, I think Jon was on the right track with
>> his sstable attached merkle tree idea. You don’t need to duplicate the full
>> data set in the index, you just need enough info to detect that something
>> is missing. If you can detect that view partition x is missing data from
>> base partition y, then you could start comparing the actual partition data
>> and figure out who’s missing what.
>>
>> On Sun, May 18, 2025, at 9:20 PM, Runtian Liu wrote:
>>
>> > If you had a custom SAI index or something, this isn’t something you’d
>> need to worry about
>> This is what I missed.
>>
>> I think this could be a potential solution, but comparing indexes alone
>> isn’t sufficient—it only handles cases where the MV has extra or missing
>> rows. It doesn’t catch data mismatches for rows that exist in both the base
>> table and MV. To address that, we may need to extend SAI for MV to store
>> the entire selected dataset in the index file, applying the same approach
>> to MV as we do for the base table. This would increase storage to roughly
>> 4x per MV, compared to the current 2x, but it would help avoid random disk
>> access during repair. I’m not sure if this would introduce any memory
>> issues during compaction.
>>
>>
>>
>> On Sun, May 18, 2025 at 8:09 PM Blake Eggleston 
>> wrote:
>>
>>
>> It *might* be more efficient, but it’s also more brittle. I think it
>> would be more fault tolerant and less trouble overall to repair
>> intersecting token ranges. So you’re not repairing a view partition, you’re
>> repairing the parts of a view partition that intersect with a base table
>> token range.
>>
>> The issues I see with the global snapshot are:
>>
>> 1. Requiring a global snapshot means that you can’t start a new repair
>> cycle if there’s a node down.
>> 2. These merkle trees can’t all be calculated at once, so we’ll need a
>> coordination mechanism to spread out scans of the snapshots
>> 3. By requiring a global snapshot and then building merkle trees from
>> that snapshot, you’re introducing a delay of however long it takes you to
>> do a full scan of both tables. So if you’re repairing your cluster every 3
>> days, it means the last range to get repaired is repairing based on a state
>> that’s now 3 days old. This makes your repair horizon 2x your scheduling
>> cadence and puts an upper bound on how up to date you can keep your view.
>>
>> With an index based approach, much of the work is just built into the
>> write and compaction paths and repair is just a scan of the intersecting
>> index segments from the base and view tables. You’re also repairing from
>> the state that existed when you started your repair, so your repair horizon
>> matches your scheduling cadence.
>>
>> On Sun, May 18, 2025, at 7:45 PM, Jaydeep Chovatia wrote:
>>
>> >Isn’t the reality here is that repairing a single partition in the base
>> table is potentially a full cluster-wide scan of the MV if you also want to
>> detect rows in the MV that don’t exist in the base table (eg resurrection
>> or a missed delete)
>> Exactly. Since materialized views (MVs) are partitioned differently from
>> their base tables, there doesn’t appear to be a more efficient way to
>> repair them in a targeted manner—meaning we can’t restrict the repair to
>> only a small portion of the data.
>>
>> Jaydeep
>>
>> On Sun, May 18, 2025 at 5:57 PM Jeff Jirsa  wrote:
>>
>>
>> Isn’t the reality here is that repairing a single partition in the base
>> table is potentially a full cluster-wide scan of the MV if you also want to
>> detect rows in the MV that don’t exist in the base table (eg resurrection
>> or a missed delete)
>>
>> There’s no getting around that. Keeping an extra index doesn’t avoid that
>> scan, it jus

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-19 Thread Jon Haddad
We could also track the deletes that need to be made to the view, in
another SSTable component on the base.  That way you can actually do repair
with tombstones.




On Mon, May 19, 2025 at 11:37 AM Blake Eggleston 
wrote:

> If we went the storage attached route then I think you’d just need more
> memory for the memtable, compaction would just be combining 2 sorted sets,
> though there would probably be some additional work related to deletes,
> overwrites, and tombstone purging.
>
> Regarding the size of the index, I think Jon was on the right track with
> his sstable attached merkle tree idea. You don’t need to duplicate the full
> data set in the index, you just need enough info to detect that something
> is missing. If you can detect that view partition x is missing data from
> base partition y, then you could start comparing the actual partition data
> and figure out who’s missing what.
>
> On Sun, May 18, 2025, at 9:20 PM, Runtian Liu wrote:
>
> > If you had a custom SAI index or something, this isn’t something you’d
> need to worry about
> This is what I missed.
>
> I think this could be a potential solution, but comparing indexes alone
> isn’t sufficient—it only handles cases where the MV has extra or missing
> rows. It doesn’t catch data mismatches for rows that exist in both the base
> table and MV. To address that, we may need to extend SAI for MV to store
> the entire selected dataset in the index file, applying the same approach
> to MV as we do for the base table. This would increase storage to roughly
> 4x per MV, compared to the current 2x, but it would help avoid random disk
> access during repair. I’m not sure if this would introduce any memory
> issues during compaction.
>
>
>
> On Sun, May 18, 2025 at 8:09 PM Blake Eggleston 
> wrote:
>
>
> It *might* be more efficient, but it’s also more brittle. I think it
> would be more fault tolerant and less trouble overall to repair
> intersecting token ranges. So you’re not repairing a view partition, you’re
> repairing the parts of a view partition that intersect with a base table
> token range.
>
> The issues I see with the global snapshot are:
>
> 1. Requiring a global snapshot means that you can’t start a new repair
> cycle if there’s a node down.
> 2. These merkle trees can’t all be calculated at once, so we’ll need a
> coordination mechanism to spread out scans of the snapshots
> 3. By requiring a global snapshot and then building merkle trees from that
> snapshot, you’re introducing a delay of however long it takes you to do a
> full scan of both tables. So if you’re repairing your cluster every 3 days,
> it means the last range to get repaired is repairing based on a state
> that’s now 3 days old. This makes your repair horizon 2x your scheduling
> cadence and puts an upper bound on how up to date you can keep your view.
>
> With an index based approach, much of the work is just built into the
> write and compaction paths and repair is just a scan of the intersecting
> index segments from the base and view tables. You’re also repairing from
> the state that existed when you started your repair, so your repair horizon
> matches your scheduling cadence.
>
> On Sun, May 18, 2025, at 7:45 PM, Jaydeep Chovatia wrote:
>
> >Isn’t the reality here is that repairing a single partition in the base
> table is potentially a full cluster-wide scan of the MV if you also want to
> detect rows in the MV that don’t exist in the base table (eg resurrection
> or a missed delete)
> Exactly. Since materialized views (MVs) are partitioned differently from
> their base tables, there doesn’t appear to be a more efficient way to
> repair them in a targeted manner—meaning we can’t restrict the repair to
> only a small portion of the data.
>
> Jaydeep
>
> On Sun, May 18, 2025 at 5:57 PM Jeff Jirsa  wrote:
>
>
> Isn’t the reality here is that repairing a single partition in the base
> table is potentially a full cluster-wide scan of the MV if you also want to
> detect rows in the MV that don’t exist in the base table (eg resurrection
> or a missed delete)
>
> There’s no getting around that. Keeping an extra index doesn’t avoid that
> scan, it just moves the problem around to another tier.
>
>
>
> On May 18, 2025, at 4:59 PM, Blake Eggleston  wrote:
>
> 
> Whether it’s index based repair or another mechanism, I think the proposed
> repair design needs to be refined. The requirement of a global snapshot and
> merkle tree build before we can start detecting and fixing problems is a
> pretty big limitation.
>
> > Data scans during repair would become random disk accesses instead of
> sequential ones, which can degrade performance.
>
> You’d only be reading and comparing the index files, not the sstable
> contents. Reads would still be sequential.
>
> > Most importantly, I decided against this approach due to the complexity
> of ensuring index consistency. Introducing secondary indexes opens up new
> challenges, such as keeping them in sync with the actual data.
>
> I

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-18 Thread Runtian Liu
> If you had a custom SAI index or something, this isn’t something you’d
need to worry about
This is what I missed.

I think this could be a potential solution, but comparing indexes alone
isn’t sufficient—it only handles cases where the MV has extra or missing
rows. It doesn’t catch data mismatches for rows that exist in both the base
table and MV. To address that, we may need to extend SAI for MV to store
the entire selected dataset in the index file, applying the same approach
to MV as we do for the base table. This would increase storage to roughly
4x per MV, compared to the current 2x, but it would help avoid random disk
access during repair. I’m not sure if this would introduce any memory
issues during compaction.


On Sun, May 18, 2025 at 8:09 PM Blake Eggleston 
wrote:

> It *might* be more efficient, but it’s also more brittle. I think it
> would be more fault tolerant and less trouble overall to repair
> intersecting token ranges. So you’re not repairing a view partition, you’re
> repairing the parts of a view partition that intersect with a base table
> token range.
>
> The issues I see with the global snapshot are:
>
> 1. Requiring a global snapshot means that you can’t start a new repair
> cycle if there’s a node down.
> 2. These merkle trees can’t all be calculated at once, so we’ll need a
> coordination mechanism to spread out scans of the snapshots
> 3. By requiring a global snapshot and then building merkle trees from that
> snapshot, you’re introducing a delay of however long it takes you to do a
> full scan of both tables. So if you’re repairing your cluster every 3 days,
> it means the last range to get repaired is repairing based on a state
> that’s now 3 days old. This makes your repair horizon 2x your scheduling
> cadence and puts an upper bound on how up to date you can keep your view.
>
> With an index based approach, much of the work is just built into the
> write and compaction paths and repair is just a scan of the intersecting
> index segments from the base and view tables. You’re also repairing from
> the state that existed when you started your repair, so your repair horizon
> matches your scheduling cadence.
>
> On Sun, May 18, 2025, at 7:45 PM, Jaydeep Chovatia wrote:
>
> >Isn’t the reality here is that repairing a single partition in the base
> table is potentially a full cluster-wide scan of the MV if you also want to
> detect rows in the MV that don’t exist in the base table (eg resurrection
> or a missed delete)
> Exactly. Since materialized views (MVs) are partitioned differently from
> their base tables, there doesn’t appear to be a more efficient way to
> repair them in a targeted manner—meaning we can’t restrict the repair to
> only a small portion of the data.
>
> Jaydeep
>
> On Sun, May 18, 2025 at 5:57 PM Jeff Jirsa  wrote:
>
>
> Isn’t the reality here is that repairing a single partition in the base
> table is potentially a full cluster-wide scan of the MV if you also want to
> detect rows in the MV that don’t exist in the base table (eg resurrection
> or a missed delete)
>
> There’s no getting around that. Keeping an extra index doesn’t avoid that
> scan, it just moves the problem around to another tier.
>
>
>
> On May 18, 2025, at 4:59 PM, Blake Eggleston  wrote:
>
> 
> Whether it’s index based repair or another mechanism, I think the proposed
> repair design needs to be refined. The requirement of a global snapshot and
> merkle tree build before we can start detecting and fixing problems is a
> pretty big limitation.
>
> > Data scans during repair would become random disk accesses instead of
> sequential ones, which can degrade performance.
>
> You’d only be reading and comparing the index files, not the sstable
> contents. Reads would still be sequential.
>
> > Most importantly, I decided against this approach due to the complexity
> of ensuring index consistency. Introducing secondary indexes opens up new
> challenges, such as keeping them in sync with the actual data.
>
> I think this is mostly a solved problem in C*? If you had a custom SAI
> index or something, this isn’t something you’d need to worry about AFAIK.
>
> On Sat, May 17, 2025, at 4:57 PM, Runtian Liu wrote:
>
> > I think you could exploit this to improve your MV repair design. Instead
> of taking global snapshots and persisting merkle trees, you could implement
> a set of secondary indexes on the base and view tables that you could
> quickly compare the contents of for repair.
>
> We actually considered this approach while designing the MV repair.
> However, there are several downsides:
>
>1.
>
>It requires additional storage for the index files.
>2.
>
>Data scans during repair would become random disk accesses instead of
>sequential ones, which can degrade performance.
>3.
>
>Most importantly, I decided against this approach due to the
>complexity of ensuring index consistency. Introducing secondary indexes
>opens up new challenges, such as keeping them 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-18 Thread Jaydeep Chovatia
>Isn’t the reality here is that repairing a single partition in the base
table is potentially a full cluster-wide scan of the MV if you also want to
detect rows in the MV that don’t exist in the base table (eg resurrection
or a missed delete)
Exactly. Since materialized views (MVs) are partitioned differently from
their base tables, there doesn’t appear to be a more efficient way to
repair them in a targeted manner—meaning we can’t restrict the repair to
only a small portion of the data.

Jaydeep

On Sun, May 18, 2025 at 5:57 PM Jeff Jirsa  wrote:

> Isn’t the reality here is that repairing a single partition in the base
> table is potentially a full cluster-wide scan of the MV if you also want to
> detect rows in the MV that don’t exist in the base table (eg resurrection
> or a missed delete)
>
> There’s no getting around that. Keeping an extra index doesn’t avoid that
> scan, it just moves the problem around to another tier.
>
>
>
> On May 18, 2025, at 4:59 PM, Blake Eggleston  wrote:
>
> 
> Whether it’s index based repair or another mechanism, I think the proposed
> repair design needs to be refined. The requirement of a global snapshot and
> merkle tree build before we can start detecting and fixing problems is a
> pretty big limitation.
>
> > Data scans during repair would become random disk accesses instead of
> sequential ones, which can degrade performance.
>
> You’d only be reading and comparing the index files, not the sstable
> contents. Reads would still be sequential.
>
> > Most importantly, I decided against this approach due to the complexity
> of ensuring index consistency. Introducing secondary indexes opens up new
> challenges, such as keeping them in sync with the actual data.
>
> I think this is mostly a solved problem in C*? If you had a custom SAI
> index or something, this isn’t something you’d need to worry about AFAIK.
>
> On Sat, May 17, 2025, at 4:57 PM, Runtian Liu wrote:
>
> > I think you could exploit this to improve your MV repair design. Instead
> of taking global snapshots and persisting merkle trees, you could implement
> a set of secondary indexes on the base and view tables that you could
> quickly compare the contents of for repair.
>
> We actually considered this approach while designing the MV repair.
> However, there are several downsides:
>
>1.
>
>It requires additional storage for the index files.
>2.
>
>Data scans during repair would become random disk accesses instead of
>sequential ones, which can degrade performance.
>3.
>
>Most importantly, I decided against this approach due to the
>complexity of ensuring index consistency. Introducing secondary indexes
>opens up new challenges, such as keeping them in sync with the actual data.
>
> The goal of the design is to provide a catch-all mismatch detection
> mechanism that targets the dataset users query during the online path. I
> did consider adding indexes at the SSTable level to guarantee consistency
> between indexes and data.
> > sorted by base table partition order, but segmented by view partition
> ranges
> If the indexes at the SSTable level, it means it will be less flexible, we
> need to rewrite the SSTables if we decide to range the view partition
> ranges.
> I didn’t explore this direction further due to the issues listed above.
>
> > The transformative repair could be done against the local index, and the
> local index can repair against the global index. It opens up a lot of
> possibilities, query wise, as well.
> This is something I’m not entirely sure about—how exactly do we use the
> local index to support the global index (i.e., the MV)? If the MV relies on
> local indexes during the query path, we can definitely dig deeper into how
> repair could work with that design.
>
> The proposed design in this CEP aims to treat the base table and its MV
> like any other regular tables, so that operations such as compaction and
> repair can be handled in the same way in most cases.
>
> On Sat, May 17, 2025 at 2:42 PM Jon Haddad 
> wrote:
>
> Yeah, this is exactly what i suggested in a different part of the thread.
> The transformative repair could be done against the local index, and the
> local index can repair against the global index. It opens up a lot of
> possibilities, query wise, as well.
>
>
> On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
> wrote:
>
>
> > They are not two unordered sets, but rather two sets ordered by
> different keys.
>
> I think you could exploit this to improve your MV repair design. Instead
> of taking global snapshots and persisting merkle trees, you could implement
> a set of secondary indexes on the base and view tables that you could
> quickly compare the contents of for repair.
>
> The indexes would have their contents sorted by base table partition
> order, but segmented by view partition ranges. Then any view <-> base
> repair would compare the intersecting index slices. That would allow you to
> repair data more quickly and with le

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-18 Thread Jeff Jirsa
Isn’t the reality here is that repairing a single partition in the base table is potentially a full cluster-wide scan of the MV if you also want to detect rows in the MV that don’t exist in the base table (eg resurrection or a missed delete)There’s no getting around that. Keeping an extra index doesn’t avoid that scan, it just moves the problem around to another tier. On May 18, 2025, at 4:59 PM, Blake Eggleston  wrote:Whether it’s index based repair or another mechanism, I think the proposed repair design needs to be refined. The requirement of a global snapshot and merkle tree build before we can start detecting and fixing problems is a pretty big limitation.> Data scans during repair would become random disk accesses instead of sequential ones, which can degrade performance.You’d only be reading and comparing the index files, not the sstable contents. Reads would still be sequential.> Most importantly, I decided against this approach due to the complexity of ensuring index consistency. Introducing secondary indexes opens up new challenges, such as keeping them in sync with the actual data.I think this is mostly a solved problem in C*? If you had a custom SAI index or something, this isn’t something you’d need to worry about AFAIK.On Sat, May 17, 2025, at 4:57 PM, Runtian Liu wrote:> I think you could exploit this to improve your MV repair design. Instead of taking global snapshots and persisting merkle trees, you could implement a set of secondary indexes on the base and view tables that you could quickly compare the contents of for repair. We actually considered this approach while designing the MV repair. However, there are several downsides:It requires additional storage for the index files.Data scans during repair would become random disk accesses instead of sequential ones, which can degrade performance.Most importantly, I decided against this approach due to the complexity of ensuring index consistency. Introducing secondary indexes opens up new challenges, such as keeping them in sync with the actual data.The goal of the design is to provide a catch-all mismatch detection mechanism that targets the dataset users query during the online path. I did consider adding indexes at the SSTable level to guarantee consistency between indexes and data. > sorted by base table partition order, but segmented by view partition rangesIf the indexes at the SSTable level, it means it will be less flexible, we need to rewrite the SSTables if we decide to range the view partition ranges.I didn’t explore this direction further due to the issues listed above.> The transformative repair could be done against the local index, and the local index can repair against the global index. It opens up a lot of possibilities, query wise, as well. This is something I’m not entirely sure about—how exactly do we use the local index to support the global index (i.e., the MV)? If the MV relies on local indexes during the query path, we can definitely dig deeper into how repair could work with that design.The proposed design in this CEP aims to treat the base table and its MV like any other regular tables, so that operations such as compaction and repair can be handled in the same way in most cases.On Sat, May 17, 2025 at 2:42 PM Jon Haddad  wrote:Yeah, this is exactly what i suggested in a different part of the thread. The transformative repair could be done against the local index, and the local index can repair against the global index. It opens up a lot of possibilities, query wise, as well. On Sat, May 17, 2025 at 1:47 PM Blake Eggleston  wrote:> They are not two unordered sets, but rather two sets ordered by different keys.I think you could exploit this to improve your MV repair design. Instead of taking global snapshots and persisting merkle trees, you could implement a set of secondary indexes on the base and view tables that you could quickly compare the contents of for repair. The indexes would have their contents sorted by base table partition order, but segmented by view partition ranges. Then any view <-> base repair would compare the intersecting index slices. That would allow you to repair data more quickly and with less operational complexity.On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:For example, in the chart above, each cell represents a Merkle tree that covers data belonging to a specific base table range and a specific MV range. When we scan a base table range, we can generate the Merkle trees marked in red. When we scan an MV range, we can generate the Merkle trees marked in green. The cells that can be compared are marked in blue.To save time and CPU resources, we persist the Merkle trees created during a scan so we don’t need to regenerate them later. This way, when other nodes scan and build Merkle trees based on the same “frozen” snapshot, we can reuse the existing Merkle trees for comparison. On Fri, May 16, 2025 at 12:22 PM Runtian Liu 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-17 Thread Runtian Liu
For each row, when calculating its hash, we first need to merge all the
SSTables that contain that row. We cannot attach a Merkle tree directly to
each SSTable, because merged Merkle trees would produce different hash
values for the same data if the compaction states differ.

On Sat, May 17, 2025 at 5:48 PM Jon Haddad  wrote:

> Could we could do that for regular repair as well? which would make a
> validation possible with barely any IO?
>
> Sstable attached merkle trees?
>
>
>
>
> On Sat, May 17, 2025 at 5:36 PM Jon Haddad 
> wrote:
>
>> What if you built the merkle tree for each sstable as a storage attached
>> index?
>>
>> Then your repair is merging merkle tables.
>>
>>
>> On Sat, May 17, 2025 at 4:57 PM Runtian Liu  wrote:
>>
>>> > I think you could exploit this to improve your MV repair design.
>>> Instead of taking global snapshots and persisting merkle trees, you could
>>> implement a set of secondary indexes on the base and view tables that you
>>> could quickly compare the contents of for repair.
>>>
>>> We actually considered this approach while designing the MV repair.
>>> However, there are several downsides:
>>>
>>>1.
>>>
>>>It requires additional storage for the index files.
>>>2.
>>>
>>>Data scans during repair would become random disk accesses instead
>>>of sequential ones, which can degrade performance.
>>>3.
>>>
>>>Most importantly, I decided against this approach due to the
>>>complexity of ensuring index consistency. Introducing secondary indexes
>>>opens up new challenges, such as keeping them in sync with the actual 
>>> data.
>>>
>>> The goal of the design is to provide a catch-all mismatch detection
>>> mechanism that targets the dataset users query during the online path. I
>>> did consider adding indexes at the SSTable level to guarantee consistency
>>> between indexes and data.
>>> > sorted by base table partition order, but segmented by view partition
>>> ranges
>>> If the indexes at the SSTable level, it means it will be less flexible,
>>> we need to rewrite the SSTables if we decide to range the view partition
>>> ranges.
>>> I didn’t explore this direction further due to the issues listed above.
>>>
>>> > The transformative repair could be done against the local index, and
>>> the local index can repair against the global index. It opens up a lot of
>>> possibilities, query wise, as well.
>>> This is something I’m not entirely sure about—how exactly do we use the
>>> local index to support the global index (i.e., the MV)? If the MV relies on
>>> local indexes during the query path, we can definitely dig deeper into how
>>> repair could work with that design.
>>>
>>> The proposed design in this CEP aims to treat the base table and its MV
>>> like any other regular tables, so that operations such as compaction and
>>> repair can be handled in the same way in most cases.
>>>
>>> On Sat, May 17, 2025 at 2:42 PM Jon Haddad 
>>> wrote:
>>>
 Yeah, this is exactly what i suggested in a different part of the
 thread. The transformative repair could be done against the local index,
 and the local index can repair against the global index. It opens up a lot
 of possibilities, query wise, as well.



 On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
 wrote:

> > They are not two unordered sets, but rather two sets ordered by
> different keys.
>
> I think you could exploit this to improve your MV repair design.
> Instead of taking global snapshots and persisting merkle trees, you could
> implement a set of secondary indexes on the base and view tables that you
> could quickly compare the contents of for repair.
>
> The indexes would have their contents sorted by base table partition
> order, but segmented by view partition ranges. Then any view <-> base
> repair would compare the intersecting index slices. That would allow you 
> to
> repair data more quickly and with less operational complexity.
>
> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> For example, in the chart above, each cell represents a Merkle tree
> that covers data belonging to a specific base table range and a specific 
> MV
> range. When we scan a base table range, we can generate the Merkle trees
> marked in red. When we scan an MV range, we can generate 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-17 Thread Jon Haddad
Could we could do that for regular repair as well? which would make a
validation possible with barely any IO?

Sstable attached merkle trees?




On Sat, May 17, 2025 at 5:36 PM Jon Haddad  wrote:

> What if you built the merkle tree for each sstable as a storage attached
> index?
>
> Then your repair is merging merkle tables.
>
>
> On Sat, May 17, 2025 at 4:57 PM Runtian Liu  wrote:
>
>> > I think you could exploit this to improve your MV repair design.
>> Instead of taking global snapshots and persisting merkle trees, you could
>> implement a set of secondary indexes on the base and view tables that you
>> could quickly compare the contents of for repair.
>>
>> We actually considered this approach while designing the MV repair.
>> However, there are several downsides:
>>
>>1.
>>
>>It requires additional storage for the index files.
>>2.
>>
>>Data scans during repair would become random disk accesses instead of
>>sequential ones, which can degrade performance.
>>3.
>>
>>Most importantly, I decided against this approach due to the
>>complexity of ensuring index consistency. Introducing secondary indexes
>>opens up new challenges, such as keeping them in sync with the actual 
>> data.
>>
>> The goal of the design is to provide a catch-all mismatch detection
>> mechanism that targets the dataset users query during the online path. I
>> did consider adding indexes at the SSTable level to guarantee consistency
>> between indexes and data.
>> > sorted by base table partition order, but segmented by view partition
>> ranges
>> If the indexes at the SSTable level, it means it will be less flexible,
>> we need to rewrite the SSTables if we decide to range the view partition
>> ranges.
>> I didn’t explore this direction further due to the issues listed above.
>>
>> > The transformative repair could be done against the local index, and
>> the local index can repair against the global index. It opens up a lot of
>> possibilities, query wise, as well.
>> This is something I’m not entirely sure about—how exactly do we use the
>> local index to support the global index (i.e., the MV)? If the MV relies on
>> local indexes during the query path, we can definitely dig deeper into how
>> repair could work with that design.
>>
>> The proposed design in this CEP aims to treat the base table and its MV
>> like any other regular tables, so that operations such as compaction and
>> repair can be handled in the same way in most cases.
>>
>> On Sat, May 17, 2025 at 2:42 PM Jon Haddad 
>> wrote:
>>
>>> Yeah, this is exactly what i suggested in a different part of the
>>> thread. The transformative repair could be done against the local index,
>>> and the local index can repair against the global index. It opens up a lot
>>> of possibilities, query wise, as well.
>>>
>>>
>>>
>>> On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
>>> wrote:
>>>
 > They are not two unordered sets, but rather two sets ordered by
 different keys.

 I think you could exploit this to improve your MV repair design.
 Instead of taking global snapshots and persisting merkle trees, you could
 implement a set of secondary indexes on the base and view tables that you
 could quickly compare the contents of for repair.

 The indexes would have their contents sorted by base table partition
 order, but segmented by view partition ranges. Then any view <-> base
 repair would compare the intersecting index slices. That would allow you to
 repair data more quickly and with less operational complexity.

 On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:






































































































 For example, in the chart above, each cell represents a Merkle tree
 that covers data belonging to a specific base table range and a specific MV
 range. When we scan a base table range, we can generate the Merkle trees
 marked in red. When we scan an MV range, we can generate the Merkle trees
 marked in green. The cells that can be compared are marked in blue.

 To save time and CPU resources, we persist the Merkle trees created
 during a scan so we don’t need to regenerate them later. This way, when
 other nodes scan and build Merkle trees based on the same “frozen”
 snapshot, we can reuse the existing Merkle trees for comparison.

 On Fri, May 16, 2025 at 12:22 PM Runtian Liu 
 wrote:

 Unfortunately, no. When building Merkle trees for small token ranges in
 the

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-17 Thread Jon Haddad
What if you built the merkle tree for each sstable as a storage attached
index?

Then your repair is merging merkle tables.


On Sat, May 17, 2025 at 4:57 PM Runtian Liu  wrote:

> > I think you could exploit this to improve your MV repair design. Instead
> of taking global snapshots and persisting merkle trees, you could implement
> a set of secondary indexes on the base and view tables that you could
> quickly compare the contents of for repair.
>
> We actually considered this approach while designing the MV repair.
> However, there are several downsides:
>
>1.
>
>It requires additional storage for the index files.
>2.
>
>Data scans during repair would become random disk accesses instead of
>sequential ones, which can degrade performance.
>3.
>
>Most importantly, I decided against this approach due to the
>complexity of ensuring index consistency. Introducing secondary indexes
>opens up new challenges, such as keeping them in sync with the actual data.
>
> The goal of the design is to provide a catch-all mismatch detection
> mechanism that targets the dataset users query during the online path. I
> did consider adding indexes at the SSTable level to guarantee consistency
> between indexes and data.
> > sorted by base table partition order, but segmented by view partition
> ranges
> If the indexes at the SSTable level, it means it will be less flexible, we
> need to rewrite the SSTables if we decide to range the view partition
> ranges.
> I didn’t explore this direction further due to the issues listed above.
>
> > The transformative repair could be done against the local index, and the
> local index can repair against the global index. It opens up a lot of
> possibilities, query wise, as well.
> This is something I’m not entirely sure about—how exactly do we use the
> local index to support the global index (i.e., the MV)? If the MV relies on
> local indexes during the query path, we can definitely dig deeper into how
> repair could work with that design.
>
> The proposed design in this CEP aims to treat the base table and its MV
> like any other regular tables, so that operations such as compaction and
> repair can be handled in the same way in most cases.
>
> On Sat, May 17, 2025 at 2:42 PM Jon Haddad 
> wrote:
>
>> Yeah, this is exactly what i suggested in a different part of the thread.
>> The transformative repair could be done against the local index, and the
>> local index can repair against the global index. It opens up a lot of
>> possibilities, query wise, as well.
>>
>>
>>
>> On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
>> wrote:
>>
>>> > They are not two unordered sets, but rather two sets ordered by
>>> different keys.
>>>
>>> I think you could exploit this to improve your MV repair design. Instead
>>> of taking global snapshots and persisting merkle trees, you could implement
>>> a set of secondary indexes on the base and view tables that you could
>>> quickly compare the contents of for repair.
>>>
>>> The indexes would have their contents sorted by base table partition
>>> order, but segmented by view partition ranges. Then any view <-> base
>>> repair would compare the intersecting index slices. That would allow you to
>>> repair data more quickly and with less operational complexity.
>>>
>>> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> For example, in the chart above, each cell represents a Merkle tree that
>>> covers data belonging to a specific base table range and a specific MV
>>> range. When we scan a base table range, we can generate the Merkle trees
>>> marked in red. When we scan an MV range, we can generate the Merkle trees
>>> marked in green. The cells that can be compared are marked in blue.
>>>
>>> To save time and CPU resources, we persist the Merkle trees created
>>> during a scan so we don’t need to regenerate them later. This way, when
>>> other nodes scan and build Merkle trees based on the same “frozen”
>>> snapshot, we can reuse the existing Merkle trees for comparison.
>>>
>>> On Fri, May 16, 2025 at 12:22 PM Runtian Liu  wrote:
>>>
>>> Unfortunately, no. When building Merkle trees for small token ranges in
>>> the base table, those ranges may span the entire MV token range. As a
>>> result, we need to scan the entire MV to generate all the necessary Merkle
>>> trees. For efficiency, we perform this as a single pass over the entire
>>> table rather than scanning a small range of the base or MV table
>>> individually. As you mentioned, with storage becoming increasingly
>>> affordable, this approach helps us save t

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-17 Thread Runtian Liu
> I think you could exploit this to improve your MV repair design. Instead
of taking global snapshots and persisting merkle trees, you could implement
a set of secondary indexes on the base and view tables that you could
quickly compare the contents of for repair.

We actually considered this approach while designing the MV repair.
However, there are several downsides:

   1.

   It requires additional storage for the index files.
   2.

   Data scans during repair would become random disk accesses instead of
   sequential ones, which can degrade performance.
   3.

   Most importantly, I decided against this approach due to the complexity
   of ensuring index consistency. Introducing secondary indexes opens up new
   challenges, such as keeping them in sync with the actual data.

The goal of the design is to provide a catch-all mismatch detection
mechanism that targets the dataset users query during the online path. I
did consider adding indexes at the SSTable level to guarantee consistency
between indexes and data.
> sorted by base table partition order, but segmented by view partition
ranges
If the indexes at the SSTable level, it means it will be less flexible, we
need to rewrite the SSTables if we decide to range the view partition
ranges.
I didn’t explore this direction further due to the issues listed above.

> The transformative repair could be done against the local index, and the
local index can repair against the global index. It opens up a lot of
possibilities, query wise, as well.
This is something I’m not entirely sure about—how exactly do we use the
local index to support the global index (i.e., the MV)? If the MV relies on
local indexes during the query path, we can definitely dig deeper into how
repair could work with that design.

The proposed design in this CEP aims to treat the base table and its MV
like any other regular tables, so that operations such as compaction and
repair can be handled in the same way in most cases.

On Sat, May 17, 2025 at 2:42 PM Jon Haddad  wrote:

> Yeah, this is exactly what i suggested in a different part of the thread.
> The transformative repair could be done against the local index, and the
> local index can repair against the global index. It opens up a lot of
> possibilities, query wise, as well.
>
>
>
> On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
> wrote:
>
>> > They are not two unordered sets, but rather two sets ordered by
>> different keys.
>>
>> I think you could exploit this to improve your MV repair design. Instead
>> of taking global snapshots and persisting merkle trees, you could implement
>> a set of secondary indexes on the base and view tables that you could
>> quickly compare the contents of for repair.
>>
>> The indexes would have their contents sorted by base table partition
>> order, but segmented by view partition ranges. Then any view <-> base
>> repair would compare the intersecting index slices. That would allow you to
>> repair data more quickly and with less operational complexity.
>>
>> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> For example, in the chart above, each cell represents a Merkle tree that
>> covers data belonging to a specific base table range and a specific MV
>> range. When we scan a base table range, we can generate the Merkle trees
>> marked in red. When we scan an MV range, we can generate the Merkle trees
>> marked in green. The cells that can be compared are marked in blue.
>>
>> To save time and CPU resources, we persist the Merkle trees created
>> during a scan so we don’t need to regenerate them later. This way, when
>> other nodes scan and build Merkle trees based on the same “frozen”
>> snapshot, we can reuse the existing Merkle trees for comparison.
>>
>> On Fri, May 16, 2025 at 12:22 PM Runtian Liu  wrote:
>>
>> Unfortunately, no. When building Merkle trees for small token ranges in
>> the base table, those ranges may span the entire MV token range. As a
>> result, we need to scan the entire MV to generate all the necessary Merkle
>> trees. For efficiency, we perform this as a single pass over the entire
>> table rather than scanning a small range of the base or MV table
>> individually. As you mentioned, with storage becoming increasingly
>> affordable, this approach helps us save time and CPU resources.
>>
>> On Fri, May 16, 2025 at 12:11 PM Jon Haddad 
>> wrote:
>>
>> I spoke too soon - endless questions are not over :)
>>
>> Since the data that's going to be repaired only covers a range, I wonder
>> if it makes sense to have the ability to issue a minimalist snapshot that
>> only hardlinks SSTables that are in a token range.  Based on what you
>> (Runtian) have said above, only a 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-17 Thread Jon Haddad
Yeah, this is exactly what i suggested in a different part of the thread.
The transformative repair could be done against the local index, and the
local index can repair against the global index. It opens up a lot of
possibilities, query wise, as well.



On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
wrote:

> > They are not two unordered sets, but rather two sets ordered by
> different keys.
>
> I think you could exploit this to improve your MV repair design. Instead
> of taking global snapshots and persisting merkle trees, you could implement
> a set of secondary indexes on the base and view tables that you could
> quickly compare the contents of for repair.
>
> The indexes would have their contents sorted by base table partition
> order, but segmented by view partition ranges. Then any view <-> base
> repair would compare the intersecting index slices. That would allow you to
> repair data more quickly and with less operational complexity.
>
> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> For example, in the chart above, each cell represents a Merkle tree that
> covers data belonging to a specific base table range and a specific MV
> range. When we scan a base table range, we can generate the Merkle trees
> marked in red. When we scan an MV range, we can generate the Merkle trees
> marked in green. The cells that can be compared are marked in blue.
>
> To save time and CPU resources, we persist the Merkle trees created during
> a scan so we don’t need to regenerate them later. This way, when other
> nodes scan and build Merkle trees based on the same “frozen” snapshot, we
> can reuse the existing Merkle trees for comparison.
>
> On Fri, May 16, 2025 at 12:22 PM Runtian Liu  wrote:
>
> Unfortunately, no. When building Merkle trees for small token ranges in
> the base table, those ranges may span the entire MV token range. As a
> result, we need to scan the entire MV to generate all the necessary Merkle
> trees. For efficiency, we perform this as a single pass over the entire
> table rather than scanning a small range of the base or MV table
> individually. As you mentioned, with storage becoming increasingly
> affordable, this approach helps us save time and CPU resources.
>
> On Fri, May 16, 2025 at 12:11 PM Jon Haddad 
> wrote:
>
> I spoke too soon - endless questions are not over :)
>
> Since the data that's going to be repaired only covers a range, I wonder
> if it makes sense to have the ability to issue a minimalist snapshot that
> only hardlinks SSTables that are in a token range.  Based on what you
> (Runtian) have said above, only a small percentage of the data would
> actually be repaired at any given time.
>
> Just a thought to save a little filesystem churn.
>
>
> On Fri, May 16, 2025 at 10:55 AM Jon Haddad 
> wrote:
>
> Nevermind about the height thing i guess its the same property.
>
> I’m done for now :)
>
> Thanks for entertaining my endless questions. My biggest concerns about
> repair have been alleviated.
>
> Jon
>
> On Fri, May 16, 2025 at 10:34 AM Jon Haddad 
> wrote:
>
> Thats the critical bit i was missing, thank you Blake.
>
> I guess we’d need to have unlimited height trees then, since you’d need to
> be able to update the hashes of individual partitions, and we’d also need
> to propagate the hashes up every time as well. I’m curious what the cost
> will look like with that.
>
> At least it’s a cpu problem not an I/O one.
>
> Jon
>
>
> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston 
> wrote:
>
>
> The merkle tree xor's the individual row hashes together, which is
> commutative. So you should be able to build a tree in the view token order
> while reading in base table token order and vise versa.
>
> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>
> Thanks for the explanation, I appreciate it.  I think you might still be
> glossing over an important point - which I'll make singularly here.
> There's a number of things I'm concerned about, but this is a big one.
>
> Calculating the hash of a partition for a Merkle tree needs to be done on
> the fully materialized, sorted partition.
>
> The examples you're giving are simple, to the point where they hide the
> problem.  Here's a better example, where the MV has a clustering column. In
> the MV's partition it'll have multiple rows, but in the base table it'll be
> stored in different pages or different SSTables entirely:
>
> CREATE TABLE test.t1 (
> id int PRIMARY KEY,
> v1 int
> );
>
> CREATE MATERIALIZED VIEW test.test_mv AS
> SELECT v1, id
> FROM test.t1
> WHERE id IS NOT NULL AND v1 IS NOT NULL
> PRIMARY KEY (v1, id)
>  WITH CLUSTERING ORDER BY (id ASC);
>
>
> Let's say we have some test data:
>
> cqlsh:test> select id, v1 from t1;
>
>  id | v1
> +
>  10 | 11
>   1 | 14
>  19 | 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-16 Thread Runtian Liu
Unfortunately, no. When building Merkle trees for small token ranges in the
base table, those ranges may span the entire MV token range. As a result,
we need to scan the entire MV to generate all the necessary Merkle trees.
For efficiency, we perform this as a single pass over the entire table
rather than scanning a small range of the base or MV table individually. As
you mentioned, with storage becoming increasingly affordable, this approach
helps us save time and CPU resources.

On Fri, May 16, 2025 at 12:11 PM Jon Haddad  wrote:

> I spoke too soon - endless questions are not over :)
>
> Since the data that's going to be repaired only covers a range, I wonder
> if it makes sense to have the ability to issue a minimalist snapshot that
> only hardlinks SSTables that are in a token range.  Based on what you
> (Runtian) have said above, only a small percentage of the data would
> actually be repaired at any given time.
>
> Just a thought to save a little filesystem churn.
>
>
> On Fri, May 16, 2025 at 10:55 AM Jon Haddad 
> wrote:
>
>> Nevermind about the height thing i guess its the same property.
>>
>> I’m done for now :)
>>
>> Thanks for entertaining my endless questions. My biggest concerns about
>> repair have been alleviated.
>>
>> Jon
>>
>> On Fri, May 16, 2025 at 10:34 AM Jon Haddad 
>> wrote:
>>
>>> Thats the critical bit i was missing, thank you Blake.
>>>
>>> I guess we’d need to have unlimited height trees then, since you’d need
>>> to be able to update the hashes of individual partitions, and we’d also
>>> need to propagate the hashes up every time as well. I’m curious what the
>>> cost will look like with that.
>>>
>>> At least it’s a cpu problem not an I/O one.
>>>
>>> Jon
>>>
>>>
>>> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston 
>>> wrote:
>>>
 The merkle tree xor's the individual row hashes together, which is
 commutative. So you should be able to build a tree in the view token order
 while reading in base table token order and vise versa.

 On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:

 Thanks for the explanation, I appreciate it.  I think you might still
 be glossing over an important point - which I'll make singularly here.
 There's a number of things I'm concerned about, but this is a big one.

 Calculating the hash of a partition for a Merkle tree needs to be done
 on the fully materialized, sorted partition.

 The examples you're giving are simple, to the point where they hide the
 problem.  Here's a better example, where the MV has a clustering column. In
 the MV's partition it'll have multiple rows, but in the base table it'll be
 stored in different pages or different SSTables entirely:

 CREATE TABLE test.t1 (
 id int PRIMARY KEY,
 v1 int
 );

 CREATE MATERIALIZED VIEW test.test_mv AS
 SELECT v1, id
 FROM test.t1
 WHERE id IS NOT NULL AND v1 IS NOT NULL
 PRIMARY KEY (v1, id)
  WITH CLUSTERING ORDER BY (id ASC);


 Let's say we have some test data:

 cqlsh:test> select id, v1 from t1;

  id | v1
 +
  10 | 11
   1 | 14
  19 | 10
   2 | 14
   3 | 14

 When we transform the data by iterating over the base table, we get
 this representation (note v1=14):

 cqlsh:test> select v1, id from t1;

  v1 | id
 +
  11 | 10
  14 |  1   <--
  10 | 19
  14 |  2 <--
  14 |  3  <--


 The partiton key in the new table is v1.  If you simply iterate and
 transform and calculate merkle trees on the fly, you'll hit v1=14 with
 id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
 and in sorted order, before you calculate the hash.  You actually need to
 transform the data to this, prior to calculating the tree:

 v1 | id
 +
  11 | 10
  14 |  1, 2, 3
  10 | 19

 Without an index you need to do one of the following over a dataset
 that's hundreds of GB:

 * for each partition, scan the entire range for all the data, then sort
 that partition in memory, then calculate the hash
 * collect the entire dataset in memory, transform and sort it
 * use a local index which has the keys already sorted

 A similar problem exists when trying to resolve the mismatches.

 Unless I'm missing some critical detail, I can't see how this will work
 without requiring nodes have hundreds of GB of RAM or we do several orders
 of magnitude more I/O than a normal repair.

 Jon



 On Thu, May 15, 2025 at 9:09 PM Runtian Liu  wrote:

 Thank you for the thoughtful questions, Jon. I really appreciate
 them—let me go through them one by one.
 ** *Do you intend on building all the Merkle trees in parallel?

 Since we take a snapshot to "freeze" the dataset, we don’t need to
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-16 Thread Jon Haddad
I spoke too soon - endless questions are not over :)

Since the data that's going to be repaired only covers a range, I wonder if
it makes sense to have the ability to issue a minimalist snapshot that only
hardlinks SSTables that are in a token range.  Based on what you (Runtian)
have said above, only a small percentage of the data would actually be
repaired at any given time.

Just a thought to save a little filesystem churn.


On Fri, May 16, 2025 at 10:55 AM Jon Haddad  wrote:

> Nevermind about the height thing i guess its the same property.
>
> I’m done for now :)
>
> Thanks for entertaining my endless questions. My biggest concerns about
> repair have been alleviated.
>
> Jon
>
> On Fri, May 16, 2025 at 10:34 AM Jon Haddad 
> wrote:
>
>> Thats the critical bit i was missing, thank you Blake.
>>
>> I guess we’d need to have unlimited height trees then, since you’d need
>> to be able to update the hashes of individual partitions, and we’d also
>> need to propagate the hashes up every time as well. I’m curious what the
>> cost will look like with that.
>>
>> At least it’s a cpu problem not an I/O one.
>>
>> Jon
>>
>>
>> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston 
>> wrote:
>>
>>> The merkle tree xor's the individual row hashes together, which is
>>> commutative. So you should be able to build a tree in the view token order
>>> while reading in base table token order and vise versa.
>>>
>>> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>>>
>>> Thanks for the explanation, I appreciate it.  I think you might still be
>>> glossing over an important point - which I'll make singularly here.
>>> There's a number of things I'm concerned about, but this is a big one.
>>>
>>> Calculating the hash of a partition for a Merkle tree needs to be done
>>> on the fully materialized, sorted partition.
>>>
>>> The examples you're giving are simple, to the point where they hide the
>>> problem.  Here's a better example, where the MV has a clustering column. In
>>> the MV's partition it'll have multiple rows, but in the base table it'll be
>>> stored in different pages or different SSTables entirely:
>>>
>>> CREATE TABLE test.t1 (
>>> id int PRIMARY KEY,
>>> v1 int
>>> );
>>>
>>> CREATE MATERIALIZED VIEW test.test_mv AS
>>> SELECT v1, id
>>> FROM test.t1
>>> WHERE id IS NOT NULL AND v1 IS NOT NULL
>>> PRIMARY KEY (v1, id)
>>>  WITH CLUSTERING ORDER BY (id ASC);
>>>
>>>
>>> Let's say we have some test data:
>>>
>>> cqlsh:test> select id, v1 from t1;
>>>
>>>  id | v1
>>> +
>>>  10 | 11
>>>   1 | 14
>>>  19 | 10
>>>   2 | 14
>>>   3 | 14
>>>
>>> When we transform the data by iterating over the base table, we get this
>>> representation (note v1=14):
>>>
>>> cqlsh:test> select v1, id from t1;
>>>
>>>  v1 | id
>>> +
>>>  11 | 10
>>>  14 |  1   <--
>>>  10 | 19
>>>  14 |  2 <--
>>>  14 |  3  <--
>>>
>>>
>>> The partiton key in the new table is v1.  If you simply iterate and
>>> transform and calculate merkle trees on the fly, you'll hit v1=14 with
>>> id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
>>> and in sorted order, before you calculate the hash.  You actually need to
>>> transform the data to this, prior to calculating the tree:
>>>
>>> v1 | id
>>> +
>>>  11 | 10
>>>  14 |  1, 2, 3
>>>  10 | 19
>>>
>>> Without an index you need to do one of the following over a dataset
>>> that's hundreds of GB:
>>>
>>> * for each partition, scan the entire range for all the data, then sort
>>> that partition in memory, then calculate the hash
>>> * collect the entire dataset in memory, transform and sort it
>>> * use a local index which has the keys already sorted
>>>
>>> A similar problem exists when trying to resolve the mismatches.
>>>
>>> Unless I'm missing some critical detail, I can't see how this will work
>>> without requiring nodes have hundreds of GB of RAM or we do several orders
>>> of magnitude more I/O than a normal repair.
>>>
>>> Jon
>>>
>>>
>>>
>>> On Thu, May 15, 2025 at 9:09 PM Runtian Liu  wrote:
>>>
>>> Thank you for the thoughtful questions, Jon. I really appreciate
>>> them—let me go through them one by one.
>>> ** *Do you intend on building all the Merkle trees in parallel?
>>>
>>> Since we take a snapshot to "freeze" the dataset, we don’t need to build
>>> all Merkle trees in parallel.
>>>
>>>
>>> * Will there be hundreds of files doing random IO to persist the trees
>>> to disk, in addition to the sequential IO from repair?
>>>
>>> The Merkle tree will only be persisted after the entire range scan is
>>> complete.
>>>
>>>
>>> * Is the intention of persisting the trees to disk to recover from
>>> failure, or just to limit memory usage?
>>>
>>> This is primarily to limit memory usage. As you may have noticed, MV
>>> repair needs to coordinate across the entire cluster rather than just a few
>>> nodes. This process may take very long time and it may node may restart or
>>> do other operations during the time.
>>>
>>>
>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-16 Thread Jon Haddad
Thats the critical bit i was missing, thank you Blake.

I guess we’d need to have unlimited height trees then, since you’d need to
be able to update the hashes of individual partitions, and we’d also need
to propagate the hashes up every time as well. I’m curious what the cost
will look like with that.

At least it’s a cpu problem not an I/O one.

Jon


On Fri, May 16, 2025 at 10:04 AM Blake Eggleston 
wrote:

> The merkle tree xor's the individual row hashes together, which is
> commutative. So you should be able to build a tree in the view token order
> while reading in base table token order and vise versa.
>
> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>
> Thanks for the explanation, I appreciate it.  I think you might still be
> glossing over an important point - which I'll make singularly here.
> There's a number of things I'm concerned about, but this is a big one.
>
> Calculating the hash of a partition for a Merkle tree needs to be done on
> the fully materialized, sorted partition.
>
> The examples you're giving are simple, to the point where they hide the
> problem.  Here's a better example, where the MV has a clustering column. In
> the MV's partition it'll have multiple rows, but in the base table it'll be
> stored in different pages or different SSTables entirely:
>
> CREATE TABLE test.t1 (
> id int PRIMARY KEY,
> v1 int
> );
>
> CREATE MATERIALIZED VIEW test.test_mv AS
> SELECT v1, id
> FROM test.t1
> WHERE id IS NOT NULL AND v1 IS NOT NULL
> PRIMARY KEY (v1, id)
>  WITH CLUSTERING ORDER BY (id ASC);
>
>
> Let's say we have some test data:
>
> cqlsh:test> select id, v1 from t1;
>
>  id | v1
> +
>  10 | 11
>   1 | 14
>  19 | 10
>   2 | 14
>   3 | 14
>
> When we transform the data by iterating over the base table, we get this
> representation (note v1=14):
>
> cqlsh:test> select v1, id from t1;
>
>  v1 | id
> +
>  11 | 10
>  14 |  1   <--
>  10 | 19
>  14 |  2 <--
>  14 |  3  <--
>
>
> The partiton key in the new table is v1.  If you simply iterate and
> transform and calculate merkle trees on the fly, you'll hit v1=14 with
> id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
> and in sorted order, before you calculate the hash.  You actually need to
> transform the data to this, prior to calculating the tree:
>
> v1 | id
> +
>  11 | 10
>  14 |  1, 2, 3
>  10 | 19
>
> Without an index you need to do one of the following over a dataset that's
> hundreds of GB:
>
> * for each partition, scan the entire range for all the data, then sort
> that partition in memory, then calculate the hash
> * collect the entire dataset in memory, transform and sort it
> * use a local index which has the keys already sorted
>
> A similar problem exists when trying to resolve the mismatches.
>
> Unless I'm missing some critical detail, I can't see how this will work
> without requiring nodes have hundreds of GB of RAM or we do several orders
> of magnitude more I/O than a normal repair.
>
> Jon
>
>
>
> On Thu, May 15, 2025 at 9:09 PM Runtian Liu  wrote:
>
> Thank you for the thoughtful questions, Jon. I really appreciate them—let
> me go through them one by one.
> ** *Do you intend on building all the Merkle trees in parallel?
>
> Since we take a snapshot to "freeze" the dataset, we don’t need to build
> all Merkle trees in parallel.
>
>
> * Will there be hundreds of files doing random IO to persist the trees to
> disk, in addition to the sequential IO from repair?
>
> The Merkle tree will only be persisted after the entire range scan is
> complete.
>
>
> * Is the intention of persisting the trees to disk to recover from
> failure, or just to limit memory usage?
>
> This is primarily to limit memory usage. As you may have noticed, MV
> repair needs to coordinate across the entire cluster rather than just a few
> nodes. This process may take very long time and it may node may restart or
> do other operations during the time.
>
>
> ** *Have you calculated the Merkle tree space requirements?
> This is a very good question—I'll add it to the CEP as well. Each leaf
> node stores a 32-byte hash. With a tree depth of 15 (which is on the higher
> end—smaller datasets might use fewer than 10 levels), a single Merkle tree
> would be approximately 32 × 2¹⁵ bytes, or 1 MB. If we split the tokens into
> 10 ranges per node, we’ll end up with around 100 Merkle trees per node,
> totaling roughly 100 MB.
> * When do we build the Merkle trees for the view?  Is that happening in
> parallel with the base table?  Do we have the computational complexity of 2
> full cluster repairs running simultaneously, or does it take twice as long?
>
> As mentioned earlier, this can be done in parallel with the base table or
> after building the base table’s Merkle tree, since we’re using a snapshot
> to “freeze” the data.
>
> > I'm very curious to hear if anyone has run a full cluster repair
> recently on a non-trivial dataset.  Every cluster I work with only d

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
One last thing.  I'm pretty sure building the tree requires the keys be
added in token order:
https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173

Which definitely introduces a bit of a problem, given that the tree would
be constructed from the transformed v1, which is a value unpredictable
enough to be considered random.

The only way I can think of to address this would be to maintain a local
index on v1.  See my previous email where I mentioned this.

Base Table -> Local Index -> Global Index

Still a really hard problem.

Jon



On Thu, May 15, 2025 at 6:12 PM Jon Haddad  wrote:

> There's a lot here that's still confusing to me.  Maybe you can help me
> understand it better?  Apologies in advance for the text wall :)
>
> I'll use this schema as an example:
>
> -
> CREATE TABLE test.t1 (
> id int PRIMARY KEY,
> v1 int
> );
>
> create MATERIALIZED VIEW  test_mv as
> SELECT v1, id from test.t1 where id is not null and v1 is not null primary
> key (v1, id);
> -
>
> We've got (id, v1) in the base table and (v1, id) in the MV.
>
> During the repair, we snapshot, and construct a whole bunch of merkle
> trees.  CEP-48 says they will be persisted to disk.
>
> ** *Do you intend on building all the Merkle trees in parallel?
> * Will there be hundreds of files doing random IO to persist the trees to
> disk, in addition to the sequential IO from repair?
> * Is the intention of persisting the trees to disk to recover from
> failure, or just to limit memory usage?
> ** *Have you calculated the Merkle tree space requirements?
> * When do we build the Merkle trees for the view?  Is that happening in
> parallel with the base table?  Do we have the computational complexity of 2
> full cluster repairs running simultaneously, or does it take twice as long?
>
> I'm very curious to hear if anyone has run a full cluster repair recently
> on a non-trivial dataset.  Every cluster I work with only does subrange
> repair.  I can't even recall the last time I did a full repair on a large
> cluster.  I may never have, now that I think about it.  Every time I've
> done this in the past it's been plagued with issues, both in terms of
> performance and reliability.  Subrange repair works because it can make
> progress in 15-30 minute increments.
>
> Anyways - moving on...
>
> You suggest we read the base table and construct the Merkle trees based on
> the transformed rows. Using my schema above, we take the v1 field and use
> token(v1), to build the tree.  Assuming that a value for v1 appears many
> times throughout the dataset across many partitions, how do you intend on
> calculating it's hash?  If you look at Validator.rowHash [1] and
> Validator.add, you'll see it's taking an UnfilteredRowIterator for an
> entire partition and calculates the hash based on that.  Here's the comment:
>
>  /**
>  * Called (in order) for every row present in the CF.
>  * Hashes the row, and adds it to the tree being built.
>  *
>  * @param partition Partition to add hash
>  */
> public void add(UnfilteredRowIterator partition)
>
> So it seems to me like you need to have the entire partition materialized
> in memory before adding to the tree.Doing that per value v1 without an
> index is pretty much impossible - we'd have to scan the entire dataset once
> per partition to pull out all the matching v1 values, or you'd need to
> materialize the entire dataset into a local version of the MV for that
> range. I don't know how you could do this.  Do you have a workaround for
> this planned?  Maybe someone that knows the Merkle tree code better can
> chime in.
>
> Maybe there's something else here I'm not aware of - please let me know
> what I'm missing here if I am, it would be great to see this in the doc if
> you have a solution.
>
> For the sake of discussion, let's assume we've moved past this and we have
> our tree for a hundreds of ranges built from the base table & built for the
> MV, now we move onto the comparison.
>
> In the doc at this point, we delete the snapshot because we have the tree
> structures and we compare Merkle trees.  Then we stream mismatched data.
>
> So let's say we find a mismatch in a hash.  That indicates that there's
> some range of data where we have an issue.  For some token range calculated
> from the v1 field, we have a mismatch, right?  What do we do with that
> information?
>
> * Do we tell the node that owned the base table - hey, stream the data
> from base where token(v1) is in range [X,Y) to me?
> * That means we have to scan through the base again for all rows where
> token(v1) in [X,Y) range, right?  Because without an index on the hashes of
> v1, we're doing a full table scan and hashing every v1 value to find out if
> it needs to be streamed back to the MV.
> * Are we doing this concurrently on all nodes?
> * Will there be coordination between all nodes in the cluster to ensu

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
There's a lot here that's still confusing to me.  Maybe you can help me
understand it better?  Apologies in advance for the text wall :)

I'll use this schema as an example:

-
CREATE TABLE test.t1 (
id int PRIMARY KEY,
v1 int
);

create MATERIALIZED VIEW  test_mv as
SELECT v1, id from test.t1 where id is not null and v1 is not null primary
key (v1, id);
-

We've got (id, v1) in the base table and (v1, id) in the MV.

During the repair, we snapshot, and construct a whole bunch of merkle
trees.  CEP-48 says they will be persisted to disk.

** *Do you intend on building all the Merkle trees in parallel?
* Will there be hundreds of files doing random IO to persist the trees to
disk, in addition to the sequential IO from repair?
* Is the intention of persisting the trees to disk to recover from failure,
or just to limit memory usage?
** *Have you calculated the Merkle tree space requirements?
* When do we build the Merkle trees for the view?  Is that happening in
parallel with the base table?  Do we have the computational complexity of 2
full cluster repairs running simultaneously, or does it take twice as long?

I'm very curious to hear if anyone has run a full cluster repair recently
on a non-trivial dataset.  Every cluster I work with only does subrange
repair.  I can't even recall the last time I did a full repair on a large
cluster.  I may never have, now that I think about it.  Every time I've
done this in the past it's been plagued with issues, both in terms of
performance and reliability.  Subrange repair works because it can make
progress in 15-30 minute increments.

Anyways - moving on...

You suggest we read the base table and construct the Merkle trees based on
the transformed rows. Using my schema above, we take the v1 field and use
token(v1), to build the tree.  Assuming that a value for v1 appears many
times throughout the dataset across many partitions, how do you intend on
calculating it's hash?  If you look at Validator.rowHash [1] and
Validator.add, you'll see it's taking an UnfilteredRowIterator for an
entire partition and calculates the hash based on that.  Here's the comment:

 /**
 * Called (in order) for every row present in the CF.
 * Hashes the row, and adds it to the tree being built.
 *
 * @param partition Partition to add hash
 */
public void add(UnfilteredRowIterator partition)

So it seems to me like you need to have the entire partition materialized
in memory before adding to the tree.Doing that per value v1 without an
index is pretty much impossible - we'd have to scan the entire dataset once
per partition to pull out all the matching v1 values, or you'd need to
materialize the entire dataset into a local version of the MV for that
range. I don't know how you could do this.  Do you have a workaround for
this planned?  Maybe someone that knows the Merkle tree code better can
chime in.

Maybe there's something else here I'm not aware of - please let me know
what I'm missing here if I am, it would be great to see this in the doc if
you have a solution.

For the sake of discussion, let's assume we've moved past this and we have
our tree for a hundreds of ranges built from the base table & built for the
MV, now we move onto the comparison.

In the doc at this point, we delete the snapshot because we have the tree
structures and we compare Merkle trees.  Then we stream mismatched data.

So let's say we find a mismatch in a hash.  That indicates that there's
some range of data where we have an issue.  For some token range calculated
from the v1 field, we have a mismatch, right?  What do we do with that
information?

* Do we tell the node that owned the base table - hey, stream the data from
base where token(v1) is in range [X,Y) to me?
* That means we have to scan through the base again for all rows where
token(v1) in [X,Y) range, right?  Because without an index on the hashes of
v1, we're doing a full table scan and hashing every v1 value to find out if
it needs to be streamed back to the MV.
* Are we doing this concurrently on all nodes?
* Will there be coordination between all nodes in the cluster to ensure you
don't have to do multiple scans?

I realized there's a lot of questions here, but unfortunately I'm having a
hard time seeing how we can workaround some of the core assumptions around
constructing Merkle trees and using them to resolve the differences in a
way that matches up with what's in the doc.  I have quite a few more things
to discuss, but I'll save them for a follow up once all these have been
sorted out.

Thanks in advance!
Jon

[1]
https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209



On Thu, May 15, 2025 at 10:10 AM Runtian Liu  wrote:

> The previous table compared the complexity of full repair and MV repair
> when reconciling one dataset with another. In production, we typically use
> a replication factor of 3 in one datacen

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Runtian Liu
The previous table compared the complexity of full repair and MV repair
when reconciling one dataset with another. In production, we typically use
a replication factor of 3 in one datacenter. This means full repair
involves 3n rows, while MV repair involves comparing 6n rows (base + MV).
Below is an updated comparison table reflecting this scenario.

n: number of rows to repair (Total rows in the table)

d: depth of one Merkle tree for MV repair

r: number of split ranges

p: data compacted away


This comparison focuses on the complexities of one round of full repair
with a replication factor of 3 versus repairing a single MV based on one
base table with replication factor 3.

Full Repair

MV Repair

Comment

Extra disk used

0

O(2*p)

Since we take a snapshot at the beginning of the repair, any disk space
that would normally be freed by compaction will remain occupied until the
Merkle trees are successfully built and the snapshot is cleared.

Data scan complexity

O(3*n)

O(6*n)

Full repair scans n rows from the primary and 2n from replicas.3

MV repair scans 3n rows from the base table and 3n from the MV.

Merkle Tree building time complexity

O(3n)

O(6*n*d)

In full repair, Merkle tree building is O(1) per row—each hash is added
sequentially to the leaf nodes.

In MV repair, each hash is inserted from the root, making it O(d) per row.
Since d is typically small (less than 20 and often smaller than in full
repair), this isn’t a major concern.

Total Merkle tree count

O(3*r)

O(6*r^2)

MV repair needs to generate more, smaller Merkle trees, but this isn’t a
concern as they can be persisted to disk during the repair process.

Merkle tree comparison complexity

O(3n)

O(3n)

Assuming one row maps to one leaf node, both repairs are equivalent.

Stream time complexity

O(3n)

O(3n)

Assuming all rows need to be streamed, both repairs are equivalent.

In short: Even for production use cases having RF=3 in one data center, we
can see that the MV repair consumes temporary disk space and a small,
usually negligible amount of extra CPU for tree construction; other costs
match full repair.

Additionally, with the online path proposed in this CEP, we expect
mismatches to be rare, which can lower the frequency of running this repair
process compared to full repair.


On Thu, May 15, 2025 at 9:53 AM Jon Haddad  wrote:

> > They are not two unordered sets, but rather two sets ordered by
> different keys.
>
> I think this is a distinction without a difference. Merkle tree repair
> works because the ordering of the data is mostly the same across nodes.
>
>
> On Thu, May 15, 2025 at 9:27 AM Runtian Liu  wrote:
>
>> > what we're trying to achieve here is comparing two massive unordered
>> sets.
>>
>> They are not two unordered sets, but rather two sets ordered by different
>> keys. This means that when building Merkle trees for the base table and the
>> materialized view (MV), we need to use different strategies to ensure the
>> trees can be meaningfully compared.
>>
>> To address scalability concerns for MV repair, I’ve included a comparison
>> between one round of full repair and MV repair in the table below. This
>> comparison is also added to the CEP.
>>
>> n: number of rows to repair (Total rows in the table)
>>
>> d: depth of one Merkle tree for MV repair
>>
>> r: number of split ranges
>>
>> p: data compacted away
>>
>>
>> This comparison focuses on the complexities of one round of full repair
>> with a replication factor of 2 versus repairing a single MV based on one
>> base table replica.
>>
>> Full Repair
>>
>> MV Repair
>>
>> Comment
>>
>> Extra disk used
>>
>> 0
>>
>> O(2*p)
>>
>> Since we take a snapshot at the beginning of the repair, any disk space
>> that would normally be freed by compaction will remain occupied until the
>> Merkle trees are successfully built and the snapshot is cleared.
>>
>> Data scan complexity
>>
>> O(2*n)
>>
>> O(2*n)
>>
>> Full repair scans n rows from the primary and n from replicas.
>>
>> MV repair scans n rows from the base table primary replica only, and n
>> from the MV primary replica only.
>>
>> Merkle Tree building time complexity
>>
>> O(n)
>>
>> O(n*d)
>>
>> In full repair, Merkle tree building is O(1) per row—each hash is added
>> sequentially to the leaf nodes.
>>
>> In MV repair, each hash is inserted from the root, making it O(d) per
>> row. Since d is typically small (less than 20 and often smaller than in
>> full repair), this isn’t a major concern.
>>
>> Total Merkle tree count
>>
>> O(2*r)
>>
>> O(2*r^2)
>>
>> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
>> concern as they can be persisted to disk during the repair process.
>>
>> Merkle tree comparison complexity
>>
>> O(n)
>>
>> O(n)
>>
>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>
>> Stream time complexity
>>
>> O(n)
>>
>> O(n)
>>
>> Assuming all rows need to be streamed, both repairs are equivalent.
>>
>> In short: MV repair consumes temporary disk sp

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
> They are not two unordered sets, but rather two sets ordered by different
keys.

I think this is a distinction without a difference. Merkle tree repair
works because the ordering of the data is mostly the same across nodes.


On Thu, May 15, 2025 at 9:27 AM Runtian Liu  wrote:

> > what we're trying to achieve here is comparing two massive unordered
> sets.
>
> They are not two unordered sets, but rather two sets ordered by different
> keys. This means that when building Merkle trees for the base table and the
> materialized view (MV), we need to use different strategies to ensure the
> trees can be meaningfully compared.
>
> To address scalability concerns for MV repair, I’ve included a comparison
> between one round of full repair and MV repair in the table below. This
> comparison is also added to the CEP.
>
> n: number of rows to repair (Total rows in the table)
>
> d: depth of one Merkle tree for MV repair
>
> r: number of split ranges
>
> p: data compacted away
>
>
> This comparison focuses on the complexities of one round of full repair
> with a replication factor of 2 versus repairing a single MV based on one
> base table replica.
>
> Full Repair
>
> MV Repair
>
> Comment
>
> Extra disk used
>
> 0
>
> O(2*p)
>
> Since we take a snapshot at the beginning of the repair, any disk space
> that would normally be freed by compaction will remain occupied until the
> Merkle trees are successfully built and the snapshot is cleared.
>
> Data scan complexity
>
> O(2*n)
>
> O(2*n)
>
> Full repair scans n rows from the primary and n from replicas.
>
> MV repair scans n rows from the base table primary replica only, and n
> from the MV primary replica only.
>
> Merkle Tree building time complexity
>
> O(n)
>
> O(n*d)
>
> In full repair, Merkle tree building is O(1) per row—each hash is added
> sequentially to the leaf nodes.
>
> In MV repair, each hash is inserted from the root, making it O(d) per
> row. Since d is typically small (less than 20 and often smaller than in
> full repair), this isn’t a major concern.
>
> Total Merkle tree count
>
> O(2*r)
>
> O(2*r^2)
>
> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
> concern as they can be persisted to disk during the repair process.
>
> Merkle tree comparison complexity
>
> O(n)
>
> O(n)
>
> Assuming one row maps to one leaf node, both repairs are equivalent.
>
> Stream time complexity
>
> O(n)
>
> O(n)
>
> Assuming all rows need to be streamed, both repairs are equivalent.
>
> In short: MV repair consumes temporary disk space and a small, usually
> negligible amount of extra CPU for tree construction; other costs match
> full repair.
>
> The core idea behind the proposed MV repair is as follows:
>
>1.
>
>Take a snapshot to “freeze” the current state of both the base table
>and its MV.
>2.
>
>Gradually scan the data from both tables to build Merkle trees.
>3.
>
>Identify the token ranges where inconsistencies exist.
>4.
>
>Rebuild only the mismatched ranges rather than the entire MV.
>
> With transaction-backed MVs, step 4 should rarely be necessary.
>
> On Thu, May 15, 2025 at 7:54 AM Josh McKenzie 
> wrote:
>
>> I think in order to address this, the view should be propagated to the
>> base replicas *after* it's accepted by all or a majority of base replicas.
>> This is where I think mutation tracking could probably help.
>>
>> Yeah, the idea of "don't reflect in the MV until you hit the CL the user
>> requested for the base table". Introduces disjoint risk if you have
>> coordinator death mid-write where replicas got base-data but that 2nd step
>> didn't take place; think that's why Runtien et. al are looking at paxos
>> repair picking up those pieces for you after the fact to get you back into
>> consistency. Mutation tracking and Accord both have similar guarantees in
>> this space.
>>
>> I think this would ensure that as long as there's no data loss or
>> bit-rot, the base and view can be repaired independently. When there is
>> data loss or bit-rot in either the base table or the view, then it is the
>> same as 2i today: rebuild is required.
>>
>> And the repair as proposed in the CEP should resolve the bitrot and bug
>> dataloss case I think. Certainly has much higher time complexity but the
>> bounding of memory complexity to be comparable with regular repair doesn't
>> strike me as a dealbreaker.
>>
>> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
>>
>> >  I think requiring a rebuild is a deal breaker for most teams.  In most
>> instances it would be having to also expand the cluster to handle the
>> additional disk requirements.  It turns an inconsistency problem into a
>> major operational headache that can take weeks to resolve.
>>
>> Agreed. The rebuild would not be required during normal operations when
>> the cluster is properly maintained (ie. regular repair) - only in
>> catastrophic situations.   This is also the case for ordinary tables
>> currently: if there's data loss, 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Runtian Liu
> what we're trying to achieve here is comparing two massive unordered
sets.

They are not two unordered sets, but rather two sets ordered by different
keys. This means that when building Merkle trees for the base table and the
materialized view (MV), we need to use different strategies to ensure the
trees can be meaningfully compared.

To address scalability concerns for MV repair, I’ve included a comparison
between one round of full repair and MV repair in the table below. This
comparison is also added to the CEP.

n: number of rows to repair (Total rows in the table)

d: depth of one Merkle tree for MV repair

r: number of split ranges

p: data compacted away


This comparison focuses on the complexities of one round of full repair
with a replication factor of 2 versus repairing a single MV based on one
base table replica.

Full Repair

MV Repair

Comment

Extra disk used

0

O(2*p)

Since we take a snapshot at the beginning of the repair, any disk space
that would normally be freed by compaction will remain occupied until the
Merkle trees are successfully built and the snapshot is cleared.

Data scan complexity

O(2*n)

O(2*n)

Full repair scans n rows from the primary and n from replicas.

MV repair scans n rows from the base table primary replica only, and n from
the MV primary replica only.

Merkle Tree building time complexity

O(n)

O(n*d)

In full repair, Merkle tree building is O(1) per row—each hash is added
sequentially to the leaf nodes.

In MV repair, each hash is inserted from the root, making it O(d) per row.
Since d is typically small (less than 20 and often smaller than in full
repair), this isn’t a major concern.

Total Merkle tree count

O(2*r)

O(2*r^2)

MV repair needs to generate more, smaller Merkle trees, but this isn’t a
concern as they can be persisted to disk during the repair process.

Merkle tree comparison complexity

O(n)

O(n)

Assuming one row maps to one leaf node, both repairs are equivalent.

Stream time complexity

O(n)

O(n)

Assuming all rows need to be streamed, both repairs are equivalent.

In short: MV repair consumes temporary disk space and a small, usually
negligible amount of extra CPU for tree construction; other costs match
full repair.

The core idea behind the proposed MV repair is as follows:

   1.

   Take a snapshot to “freeze” the current state of both the base table and
   its MV.
   2.

   Gradually scan the data from both tables to build Merkle trees.
   3.

   Identify the token ranges where inconsistencies exist.
   4.

   Rebuild only the mismatched ranges rather than the entire MV.

With transaction-backed MVs, step 4 should rarely be necessary.

On Thu, May 15, 2025 at 7:54 AM Josh McKenzie  wrote:

> I think in order to address this, the view should be propagated to the
> base replicas *after* it's accepted by all or a majority of base replicas.
> This is where I think mutation tracking could probably help.
>
> Yeah, the idea of "don't reflect in the MV until you hit the CL the user
> requested for the base table". Introduces disjoint risk if you have
> coordinator death mid-write where replicas got base-data but that 2nd step
> didn't take place; think that's why Runtien et. al are looking at paxos
> repair picking up those pieces for you after the fact to get you back into
> consistency. Mutation tracking and Accord both have similar guarantees in
> this space.
>
> I think this would ensure that as long as there's no data loss or bit-rot,
> the base and view can be repaired independently. When there is data loss or
> bit-rot in either the base table or the view, then it is the same as 2i
> today: rebuild is required.
>
> And the repair as proposed in the CEP should resolve the bitrot and bug
> dataloss case I think. Certainly has much higher time complexity but the
> bounding of memory complexity to be comparable with regular repair doesn't
> strike me as a dealbreaker.
>
> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
>
> >  I think requiring a rebuild is a deal breaker for most teams.  In most
> instances it would be having to also expand the cluster to handle the
> additional disk requirements.  It turns an inconsistency problem into a
> major operational headache that can take weeks to resolve.
>
> Agreed. The rebuild would not be required during normal operations when
> the cluster is properly maintained (ie. regular repair) - only in
> catastrophic situations.   This is also the case for ordinary tables
> currently: if there's data loss, then restoring from a backup is needed.
> This could be a possible alternative to not require a rebuild in this
> extraordinary scenario.
>
> On Thu, May 15, 2025 at 10:14 AM Jon Haddad 
> wrote:
>
> I think requiring a rebuild is a deal breaker for most teams.  In most
> instances it would be having to also expand the cluster to handle the
> additional disk requirements.  It turns an inconsistency problem into a
> major operational headache that can take weeks to resolve.
>
>
>
>
>
> On Thu,

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Josh McKenzie
> I think in order to address this, the view should be propagated to the base 
> replicas *after* it's accepted by all or a majority of base replicas. This is 
> where I think mutation tracking could probably help.
Yeah, the idea of "don't reflect in the MV until you hit the CL the user 
requested for the base table". Introduces disjoint risk if you have coordinator 
death mid-write where replicas got base-data but that 2nd step didn't take 
place; think that's why Runtien et. al are looking at paxos repair picking up 
those pieces for you after the fact to get you back into consistency. Mutation 
tracking and Accord both have similar guarantees in this space.

> I think this would ensure that as long as there's no data loss or bit-rot, 
> the base and view can be repaired independently. When there is data loss or 
> bit-rot in either the base table or the view, then it is the same as 2i 
> today: rebuild is required.
And the repair as proposed in the CEP should resolve the bitrot and bug 
dataloss case I think. Certainly has much higher time complexity but the 
bounding of memory complexity to be comparable with regular repair doesn't 
strike me as a dealbreaker.

On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
> >  I think requiring a rebuild is a deal breaker for most teams.  In most 
> > instances it would be having to also expand the cluster to handle the 
> > additional disk requirements.  It turns an inconsistency problem into a 
> > major operational headache that can take weeks to resolve.
> 
> Agreed. The rebuild would not be required during normal operations when the 
> cluster is properly maintained (ie. regular repair) - only in catastrophic 
> situations.   This is also the case for ordinary tables currently: if there's 
> data loss, then restoring from a backup is needed. This could be a possible 
> alternative to not require a rebuild in this extraordinary scenario.
> 
> On Thu, May 15, 2025 at 10:14 AM Jon Haddad  wrote:
>> I think requiring a rebuild is a deal breaker for most teams.  In most 
>> instances it would be having to also expand the cluster to handle the 
>> additional disk requirements.  It turns an inconsistency problem into a 
>> major operational headache that can take weeks to resolve.
>> 
>> 
>> 
>> 
>> 
>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta  wrote:
>>> > There's bi-directional entropy issues with MV's - either orphaned view 
>>> > data or missing view data; that's why you kind of need a "bi-directional 
>>> > ETL" to make sure the 2 agree with each other. While normal repair would 
>>> > resolve the "missing data in MV" case, it wouldn't resolve the "data in 
>>> > MV that's not in base table anymore" case, which afaict all base 
>>> > consistency approaches (status quo, PaxosV2, Accord, Mutation Tracking) 
>>> > are vulnerable to.
>>> 
>>> I don't think that bi-directional reconciliation should be a requirement, 
>>> when the base table is assumed to be the source of truth as stated in the 
>>> CEP doc.
>>> 
>>> I think the main issue with the current MV implementation is that each view 
>>> replica is independently replicated by the base replica, before the base 
>>> write is acknowledged.
>>> 
>>> This creates a correctness issue in the write path, because a view update 
>>> can be created for a write that was not accepted by the coordinator in the 
>>> following scenario:
>>> 
>>> N=RF=3
>>> CL=ONE
>>> - Update U is propagated to view replica V, coordinator that is also base 
>>> replica B dies before accepting base table write request to client. Now U 
>>> exists in V but not in B.
>>> 
>>> I think in order to address this, the view should be propagated to the base 
>>> replicas *after* it's accepted by all or a majority of base replicas. This 
>>> is where I think mutation tracking could probably help.
>>> 
>>> I think this would ensure that as long as there's no data loss or bit-rot, 
>>> the base and view can be repaired independently. When there is data loss or 
>>> bit-rot in either the base table or the view, then it is the same as 2i 
>>> today: rebuild is required.
>>> 
>>> >  It'd be correct (if operationally disappointing) to be able to just say 
>>> > "if you have data loss in your base table you need to rebuild the 
>>> > corresponding MV's", but the problem is operators aren't always going to 
>>> > know when that data loss occurs. Not everything is as visible as a lost 
>>> > quorum of replicas or blown up SSTables.
>>> 
>>> I think there are opportunities to improve rebuild speed, assuming the base 
>>> table as a source of truth. For example, rebuild only subranges when 
>>> data-loss is detected.
>>> 
>>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie  wrote:
 __
 There's bi-directional entropy issues with MV's - either orphaned view 
 data or missing view data; that's why you kind of need a "bi-directional 
 ETL" to make sure the 2 agree with each other. While normal repair would 
 resolve the "missing data in MV" 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Paulo Motta
>  I think requiring a rebuild is a deal breaker for most teams.  In most
instances it would be having to also expand the cluster to handle the
additional disk requirements.  It turns an inconsistency problem into a
major operational headache that can take weeks to resolve.

Agreed. The rebuild would not be required during normal operations when the
cluster is properly maintained (ie. regular repair) - only in catastrophic
situations.   This is also the case for ordinary tables currently: if
there's data loss, then restoring from a backup is needed. This could be a
possible alternative to not require a rebuild in this extraordinary
scenario.

On Thu, May 15, 2025 at 10:14 AM Jon Haddad  wrote:

> I think requiring a rebuild is a deal breaker for most teams.  In most
> instances it would be having to also expand the cluster to handle the
> additional disk requirements.  It turns an inconsistency problem into a
> major operational headache that can take weeks to resolve.
>
>
>
>
>
> On Thu, May 15, 2025 at 7:02 AM Paulo Motta 
> wrote:
>
>> > There's bi-directional entropy issues with MV's - either orphaned view
>> data or missing view data; that's why you kind of need a "bi-directional
>> ETL" to make sure the 2 agree with each other. While normal repair would
>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>> that's not in base table anymore" case, which afaict all base consistency
>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>> to.
>>
>> I don't think that bi-directional reconciliation should be a requirement,
>> when the base table is assumed to be the source of truth as stated in the
>> CEP doc.
>>
>> I think the main issue with the current MV implementation is that each
>> view replica is independently replicated by the base replica, before the
>> base write is acknowledged.
>>
>> This creates a correctness issue in the write path, because a view update
>> can be created for a write that was not accepted by the coordinator in the
>> following scenario:
>>
>> N=RF=3
>> CL=ONE
>> - Update U is propagated to view replica V, coordinator that is also base
>> replica B dies before accepting base table write request to client. Now U
>> exists in V but not in B.
>>
>> I think in order to address this, the view should be propagated to the
>> base replicas *after* it's accepted by all or a majority of base replicas.
>> This is where I think mutation tracking could probably help.
>>
>> I think this would ensure that as long as there's no data loss or
>> bit-rot, the base and view can be repaired independently. When there is
>> data loss or bit-rot in either the base table or the view, then it is the
>> same as 2i today: rebuild is required.
>>
>> >  It'd be correct (if operationally disappointing) to be able to just
>> say "if you have data loss in your base table you need to rebuild the
>> corresponding MV's", but the problem is operators aren't always going to
>> know when that data loss occurs. Not everything is as visible as a lost
>> quorum of replicas or blown up SSTables.
>>
>> I think there are opportunities to improve rebuild speed, assuming the
>> base table as a source of truth. For example, rebuild only subranges when
>> data-loss is detected.
>>
>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie 
>> wrote:
>>
>>> There's bi-directional entropy issues with MV's - either orphaned view
>>> data or missing view data; that's why you kind of need a "bi-directional
>>> ETL" to make sure the 2 agree with each other. While normal repair would
>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>>> that's not in base table anymore" case, which afaict all base consistency
>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>>> to.
>>>
>>> It'd be correct (if operationally disappointing) to be able to just say
>>> "if you have data loss in your base table you need to rebuild the
>>> corresponding MV's", but the problem is operators aren't always going to
>>> know when that data loss occurs. Not everything is as visible as a lost
>>> quorum of replicas or blown up SSTables.
>>>
>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>>
>>> Maybe, I’m not really familiar enough with how “classic” MV repair works
>>> to say. You can’t mix normal repair and mutation reconciliation in the
>>> current incarnation of mutation tracking though, so I wouldn’t assume it
>>> would work with MVs.
>>>
>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>>
>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair
>>> (just the MV against the other nodes) resolve the issue?
>>>
>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston 
>>> wrote:
>>>
>>>
>>> Mutation tracking is definitely an approach you could take for MVs.
>>> Mutation reconciliation could be extended to ensure all changes have been
>>> replicated to the views. When a base table received a mutation w/ an id it
>>> would gener

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
I think requiring a rebuild is a deal breaker for most teams.  In most
instances it would be having to also expand the cluster to handle the
additional disk requirements.  It turns an inconsistency problem into a
major operational headache that can take weeks to resolve.





On Thu, May 15, 2025 at 7:02 AM Paulo Motta 
wrote:

> > There's bi-directional entropy issues with MV's - either orphaned view
> data or missing view data; that's why you kind of need a "bi-directional
> ETL" to make sure the 2 agree with each other. While normal repair would
> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
> that's not in base table anymore" case, which afaict all base consistency
> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
> to.
>
> I don't think that bi-directional reconciliation should be a requirement,
> when the base table is assumed to be the source of truth as stated in the
> CEP doc.
>
> I think the main issue with the current MV implementation is that each
> view replica is independently replicated by the base replica, before the
> base write is acknowledged.
>
> This creates a correctness issue in the write path, because a view update
> can be created for a write that was not accepted by the coordinator in the
> following scenario:
>
> N=RF=3
> CL=ONE
> - Update U is propagated to view replica V, coordinator that is also base
> replica B dies before accepting base table write request to client. Now U
> exists in V but not in B.
>
> I think in order to address this, the view should be propagated to the
> base replicas *after* it's accepted by all or a majority of base replicas.
> This is where I think mutation tracking could probably help.
>
> I think this would ensure that as long as there's no data loss or bit-rot,
> the base and view can be repaired independently. When there is data loss or
> bit-rot in either the base table or the view, then it is the same as 2i
> today: rebuild is required.
>
> >  It'd be correct (if operationally disappointing) to be able to just say
> "if you have data loss in your base table you need to rebuild the
> corresponding MV's", but the problem is operators aren't always going to
> know when that data loss occurs. Not everything is as visible as a lost
> quorum of replicas or blown up SSTables.
>
> I think there are opportunities to improve rebuild speed, assuming the
> base table as a source of truth. For example, rebuild only subranges when
> data-loss is detected.
>
> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie 
> wrote:
>
>> There's bi-directional entropy issues with MV's - either orphaned view
>> data or missing view data; that's why you kind of need a "bi-directional
>> ETL" to make sure the 2 agree with each other. While normal repair would
>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>> that's not in base table anymore" case, which afaict all base consistency
>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>> to.
>>
>> It'd be correct (if operationally disappointing) to be able to just say
>> "if you have data loss in your base table you need to rebuild the
>> corresponding MV's", but the problem is operators aren't always going to
>> know when that data loss occurs. Not everything is as visible as a lost
>> quorum of replicas or blown up SSTables.
>>
>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>
>> Maybe, I’m not really familiar enough with how “classic” MV repair works
>> to say. You can’t mix normal repair and mutation reconciliation in the
>> current incarnation of mutation tracking though, so I wouldn’t assume it
>> would work with MVs.
>>
>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>
>> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just
>> the MV against the other nodes) resolve the issue?
>>
>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston 
>> wrote:
>>
>>
>> Mutation tracking is definitely an approach you could take for MVs.
>> Mutation reconciliation could be extended to ensure all changes have been
>> replicated to the views. When a base table received a mutation w/ an id it
>> would generate a view update. If you block marking a given mutation id as
>> reconciled until it’s been fully replicated to the base table and its view
>> updates have been fully replicated to the views, then all view updates will
>> eventually be applied as part of the log reconciliation process.
>>
>> A mutation tracking implementation would also allow you to be more
>> flexible with the types of consistency levels you can work with, allowing
>> users to do things like use LOCAL_QUORUM without leaving themselves open to
>> introducing view inconsistencies.
>>
>> That would more or less eliminate the need for any MV repair in normal
>> usage, but wouldn't address how to repair issues caused by bugs or data
>> loss, though you may be able to do something with comparing the latest
>> mutation ids for the 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Paulo Motta
> There's bi-directional entropy issues with MV's - either orphaned view
data or missing view data; that's why you kind of need a "bi-directional
ETL" to make sure the 2 agree with each other. While normal repair would
resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
that's not in base table anymore" case, which afaict all base consistency
approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
to.

I don't think that bi-directional reconciliation should be a requirement,
when the base table is assumed to be the source of truth as stated in the
CEP doc.

I think the main issue with the current MV implementation is that each view
replica is independently replicated by the base replica, before the base
write is acknowledged.

This creates a correctness issue in the write path, because a view update
can be created for a write that was not accepted by the coordinator in the
following scenario:

N=RF=3
CL=ONE
- Update U is propagated to view replica V, coordinator that is also base
replica B dies before accepting base table write request to client. Now U
exists in V but not in B.

I think in order to address this, the view should be propagated to the base
replicas *after* it's accepted by all or a majority of base replicas. This
is where I think mutation tracking could probably help.

I think this would ensure that as long as there's no data loss or bit-rot,
the base and view can be repaired independently. When there is data loss or
bit-rot in either the base table or the view, then it is the same as 2i
today: rebuild is required.

>  It'd be correct (if operationally disappointing) to be able to just say
"if you have data loss in your base table you need to rebuild the
corresponding MV's", but the problem is operators aren't always going to
know when that data loss occurs. Not everything is as visible as a lost
quorum of replicas or blown up SSTables.

I think there are opportunities to improve rebuild speed, assuming the base
table as a source of truth. For example, rebuild only subranges when
data-loss is detected.

On Thu, May 15, 2025 at 8:07 AM Josh McKenzie  wrote:

> There's bi-directional entropy issues with MV's - either orphaned view
> data or missing view data; that's why you kind of need a "bi-directional
> ETL" to make sure the 2 agree with each other. While normal repair would
> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
> that's not in base table anymore" case, which afaict all base consistency
> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
> to.
>
> It'd be correct (if operationally disappointing) to be able to just say
> "if you have data loss in your base table you need to rebuild the
> corresponding MV's", but the problem is operators aren't always going to
> know when that data loss occurs. Not everything is as visible as a lost
> quorum of replicas or blown up SSTables.
>
> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>
> Maybe, I’m not really familiar enough with how “classic” MV repair works
> to say. You can’t mix normal repair and mutation reconciliation in the
> current incarnation of mutation tracking though, so I wouldn’t assume it
> would work with MVs.
>
> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>
> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just
> the MV against the other nodes) resolve the issue?
>
> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston 
> wrote:
>
>
> Mutation tracking is definitely an approach you could take for MVs.
> Mutation reconciliation could be extended to ensure all changes have been
> replicated to the views. When a base table received a mutation w/ an id it
> would generate a view update. If you block marking a given mutation id as
> reconciled until it’s been fully replicated to the base table and its view
> updates have been fully replicated to the views, then all view updates will
> eventually be applied as part of the log reconciliation process.
>
> A mutation tracking implementation would also allow you to be more
> flexible with the types of consistency levels you can work with, allowing
> users to do things like use LOCAL_QUORUM without leaving themselves open to
> introducing view inconsistencies.
>
> That would more or less eliminate the need for any MV repair in normal
> usage, but wouldn't address how to repair issues caused by bugs or data
> loss, though you may be able to do something with comparing the latest
> mutation ids for the base tables and its view ranges.
>
> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>
> I don't see mutation tracking [1] mentioned in this thread or in the
> CEP-48 description. Not sure this would fit into the scope of this
> initial CEP, but I have a feeling that mutation tracking could be
> potentially helpful to reconcile base tables and views ?
>
> For example, when both base and view updates are acknowledged then this
> could be somehow persiste

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Josh McKenzie
There's bi-directional entropy issues with MV's - either orphaned view data or 
missing view data; that's why you kind of need a "bi-directional ETL" to make 
sure the 2 agree with each other. While normal repair would resolve the 
"missing data in MV" case, it wouldn't resolve the "data in MV that's not in 
base table anymore" case, which afaict all base consistency approaches (status 
quo, PaxosV2, Accord, Mutation Tracking) are vulnerable to.

It'd be correct (if operationally disappointing) to be able to just say "if you 
have data loss in your base table you need to rebuild the corresponding MV's", 
but the problem is operators aren't always going to know when that data loss 
occurs. Not everything is as visible as a lost quorum of replicas or blown up 
SSTables.

On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
> Maybe, I’m not really familiar enough with how “classic” MV repair works to 
> say. You can’t mix normal repair and mutation reconciliation in the current 
> incarnation of mutation tracking though, so I wouldn’t assume it would work 
> with MVs.
> 
> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just 
>> the MV against the other nodes) resolve the issue?
>> 
>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston  
>> wrote:
>>> __
>>> Mutation tracking is definitely an approach you could take for MVs. 
>>> Mutation reconciliation could be extended to ensure all changes have been 
>>> replicated to the views. When a base table received a mutation w/ an id it 
>>> would generate a view update. If you block marking a given mutation id as 
>>> reconciled until it’s been fully replicated to the base table and its view 
>>> updates have been fully replicated to the views, then all view updates will 
>>> eventually be applied as part of the log reconciliation process.
>>> 
>>> A mutation tracking implementation would also allow you to be more flexible 
>>> with the types of consistency levels you can work with, allowing users to 
>>> do things like use LOCAL_QUORUM without leaving themselves open to 
>>> introducing view inconsistencies.
>>> 
>>> That would more or less eliminate the need for any MV repair in normal 
>>> usage, but wouldn't address how to repair issues caused by bugs or data 
>>> loss, though you may be able to do something with comparing the latest 
>>> mutation ids for the base tables and its view ranges.
>>> 
>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
 I don't see mutation tracking [1] mentioned in this thread or in the 
 CEP-48 description. Not sure this would fit into the scope of this initial 
 CEP, but I have a feeling that mutation tracking could be potentially 
 helpful to reconcile base tables and views ?
 
 For example, when both base and view updates are acknowledged then this 
 could be somehow persisted in the view sstables mutation tracking 
 summary[2] or similar metadata ? Then these updates would be skipped 
 during view repair, considerably reducing the amount of work needed, since 
 only un-acknowledged views updates would need to be reconciled.
 
 [1] - 
 https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
  
 
 [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
 
 On Wed, May 14, 2025 at 12:59 PM Paulo Motta  
 wrote:
> > - The first thing I notice is that we're talking about repairing the 
> > entire table across the entire cluster all in one go.  It's been a 
> > *long* time since I tried to do a full repair of an entire table 
> > without using sub-ranges.  Is anyone here even doing that with clusters 
> > of non-trivial size?  How long does a full repair of a 100 node cluster 
> > with 5TB / node take even in the best case scenario?
> 
> I haven't checked the CEP yet so I may be missing out something but I 
> think this effort doesn't need to be conflated with dense node support, 
> to make this more approachable. I think prospective users would be OK 
> with overprovisioning to make this feasible if needed. We could perhaps 
> have size guardrails that limit the maximum table size per node when MVs 
> are enabled. Ideally we should make it work for dense nodes if possible, 
> but this shouldn't be a reason not to support the feature if it can be 
> made to work reasonably with more resources.
> 
> I think the main issue with the current MV is about correctness, and the 
> ultimate goal of the CEP must be to provide correctness guarantees, even 
> if it has an inevitable performance hit. I think that the performance of 
> the repair process is definitely an important consideration and it would 
> be helpful to have some benchmarks to have an idea of how long this 
> repair 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Blake Eggleston
Maybe, I’m not really familiar enough with how “classic” MV repair works to 
say. You can’t mix normal repair and mutation reconciliation in the current 
incarnation of mutation tracking though, so I wouldn’t assume it would work 
with MVs.

On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just the 
> MV against the other nodes) resolve the issue?
> 
> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston  wrote:
>> __
>> Mutation tracking is definitely an approach you could take for MVs. Mutation 
>> reconciliation could be extended to ensure all changes have been replicated 
>> to the views. When a base table received a mutation w/ an id it would 
>> generate a view update. If you block marking a given mutation id as 
>> reconciled until it’s been fully replicated to the base table and its view 
>> updates have been fully replicated to the views, then all view updates will 
>> eventually be applied as part of the log reconciliation process.
>> 
>> A mutation tracking implementation would also allow you to be more flexible 
>> with the types of consistency levels you can work with, allowing users to do 
>> things like use LOCAL_QUORUM without leaving themselves open to introducing 
>> view inconsistencies.
>> 
>> That would more or less eliminate the need for any MV repair in normal 
>> usage, but wouldn't address how to repair issues caused by bugs or data 
>> loss, though you may be able to do something with comparing the latest 
>> mutation ids for the base tables and its view ranges.
>> 
>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>>> I don't see mutation tracking [1] mentioned in this thread or in the CEP-48 
>>> description. Not sure this would fit into the scope of this initial CEP, 
>>> but I have a feeling that mutation tracking could be potentially helpful to 
>>> reconcile base tables and views ?
>>> 
>>> For example, when both base and view updates are acknowledged then this 
>>> could be somehow persisted in the view sstables mutation tracking 
>>> summary[2] or similar metadata ? Then these updates would be skipped during 
>>> view repair, considerably reducing the amount of work needed, since only 
>>> un-acknowledged views updates would need to be reconciled.
>>> 
>>> [1] - 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
>>>  
>>> 
>>> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>>> 
>>> On Wed, May 14, 2025 at 12:59 PM Paulo Motta  
>>> wrote:
 > - The first thing I notice is that we're talking about repairing the 
 > entire table across the entire cluster all in one go.  It's been a 
 > *long* time since I tried to do a full repair of an entire table without 
 > using sub-ranges.  Is anyone here even doing that with clusters of 
 > non-trivial size?  How long does a full repair of a 100 node cluster 
 > with 5TB / node take even in the best case scenario?
 
 I haven't checked the CEP yet so I may be missing out something but I 
 think this effort doesn't need to be conflated with dense node support, to 
 make this more approachable. I think prospective users would be OK with 
 overprovisioning to make this feasible if needed. We could perhaps have 
 size guardrails that limit the maximum table size per node when MVs are 
 enabled. Ideally we should make it work for dense nodes if possible, but 
 this shouldn't be a reason not to support the feature if it can be made to 
 work reasonably with more resources.
 
 I think the main issue with the current MV is about correctness, and the 
 ultimate goal of the CEP must be to provide correctness guarantees, even 
 if it has an inevitable performance hit. I think that the performance of 
 the repair process is definitely an important consideration and it would 
 be helpful to have some benchmarks to have an idea of how long this repair 
 process would take for lightweight and denser tables.
 
 On Wed, May 14, 2025 at 7:28 AM Jon Haddad  
 wrote:
> I've got several concerns around this repair process.
> 
> - The first thing I notice is that we're talking about repairing the 
> entire table across the entire cluster all in one go.  It's been a *long* 
> time since I tried to do a full repair of an entire table without using 
> sub-ranges.  Is anyone here even doing that with clusters of non trivial 
> size?  How long does a full repair of a 100 node cluster with 5TB / node 
> take even in the best case scenario?
> 
> - Even in a scenario where sub-range repair is supported, you'd have to 
> scan *every* sstable on the base table in order to construct the a merkle 
> tree, as we don't know in advance which SSTables contain the ranges that 
> the MV will.  That means a subrange r

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Jon Haddad
In the case of bitrot / losing an SSTable, wouldn't a normal repair (just
the MV against the other nodes) resolve the issue?

On Wed, May 14, 2025 at 11:27 AM Blake Eggleston 
wrote:

> Mutation tracking is definitely an approach you could take for MVs.
> Mutation reconciliation could be extended to ensure all changes have been
> replicated to the views. When a base table received a mutation w/ an id it
> would generate a view update. If you block marking a given mutation id as
> reconciled until it’s been fully replicated to the base table and its view
> updates have been fully replicated to the views, then all view updates will
> eventually be applied as part of the log reconciliation process.
>
> A mutation tracking implementation would also allow you to be more
> flexible with the types of consistency levels you can work with, allowing
> users to do things like use LOCAL_QUORUM without leaving themselves open to
> introducing view inconsistencies.
>
> That would more or less eliminate the need for any MV repair in normal
> usage, but wouldn't address how to repair issues caused by bugs or data
> loss, though you may be able to do something with comparing the latest
> mutation ids for the base tables and its view ranges.
>
> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>
> I don't see mutation tracking [1] mentioned in this thread or in the
> CEP-48 description. Not sure this would fit into the scope of this
> initial CEP, but I have a feeling that mutation tracking could be
> potentially helpful to reconcile base tables and views ?
>
> For example, when both base and view updates are acknowledged then this
> could be somehow persisted in the view sstables mutation tracking
> summary[2] or similar metadata ? Then these updates would be skipped during
> view repair, considerably reducing the amount of work needed, since only
> un-acknowledged views updates would need to be reconciled.
>
> [1] -
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
> 
> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>
> On Wed, May 14, 2025 at 12:59 PM Paulo Motta 
> wrote:
>
> > - The first thing I notice is that we're talking about repairing the
> entire table across the entire cluster all in one go.  It's been a *long*
> time since I tried to do a full repair of an entire table without using
> sub-ranges.  Is anyone here even doing that with clusters of non-trivial
> size?  How long does a full repair of a 100 node cluster with 5TB / node
> take even in the best case scenario?
>
> I haven't checked the CEP yet so I may be missing out something but I
> think this effort doesn't need to be conflated with dense node support, to
> make this more approachable. I think prospective users would be OK with
> overprovisioning to make this feasible if needed. We could perhaps have
> size guardrails that limit the maximum table size per node when MVs are
> enabled. Ideally we should make it work for dense nodes if possible, but
> this shouldn't be a reason not to support the feature if it can be made to
> work reasonably with more resources.
>
> I think the main issue with the current MV is about correctness, and the
> ultimate goal of the CEP must be to provide correctness guarantees, even if
> it has an inevitable performance hit. I think that the performance of the
> repair process is definitely an important consideration and it would be
> helpful to have some benchmarks to have an idea of how long this repair
> process would take for lightweight and denser tables.
>
> On Wed, May 14, 2025 at 7:28 AM Jon Haddad 
> wrote:
>
> I've got several concerns around this repair process.
>
> - The first thing I notice is that we're talking about repairing the
> entire table across the entire cluster all in one go.  It's been a *long*
> time since I tried to do a full repair of an entire table without using
> sub-ranges.  Is anyone here even doing that with clusters of non trivial
> size?  How long does a full repair of a 100 node cluster with 5TB / node
> take even in the best case scenario?
>
> - Even in a scenario where sub-range repair is supported, you'd have to
> scan *every* sstable on the base table in order to construct the a merkle
> tree, as we don't know in advance which SSTables contain the ranges that
> the MV will.  That means a subrange repair would have to do a *ton* of IO.
> Anyone who's mis-configured a sub-range incremental repair to use too many
> ranges will probably be familiar with how long it can take to anti-compact
> a bunch of SSTables.  With MV sub-range repair, we'd have even more
> overhead, because we'd have to read in every SSTable, every time.  If we do
> 10 subranges, we'll do 10x the IO of a normal repair.  I don't think this
> is practical.
>
> - Merkle trees make sense when you're comparing tables with the same
> partition key, but I don't think they do when you're

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Blake Eggleston
Mutation tracking is definitely an approach you could take for MVs. Mutation 
reconciliation could be extended to ensure all changes have been replicated to 
the views. When a base table received a mutation w/ an id it would generate a 
view update. If you block marking a given mutation id as reconciled until it’s 
been fully replicated to the base table and its view updates have been fully 
replicated to the views, then all view updates will eventually be applied as 
part of the log reconciliation process.

A mutation tracking implementation would also allow you to be more flexible 
with the types of consistency levels you can work with, allowing users to do 
things like use LOCAL_QUORUM without leaving themselves open to introducing 
view inconsistencies.

That would more or less eliminate the need for any MV repair in normal usage, 
but wouldn't address how to repair issues caused by bugs or data loss, though 
you may be able to do something with comparing the latest mutation ids for the 
base tables and its view ranges.

On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
> I don't see mutation tracking [1] mentioned in this thread or in the CEP-48 
> description. Not sure this would fit into the scope of this initial CEP, but 
> I have a feeling that mutation tracking could be potentially helpful to 
> reconcile base tables and views ? 
> 
> For example, when both base and view updates are acknowledged then this could 
> be somehow persisted in the view sstables mutation tracking summary[2] or 
> similar metadata ? Then these updates would be skipped during view repair, 
> considerably reducing the amount of work needed, since only un-acknowledged 
> views updates would need to be reconciled.
> 
> [1] - 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
> 
> On Wed, May 14, 2025 at 12:59 PM Paulo Motta  wrote:
>> > - The first thing I notice is that we're talking about repairing the 
>> > entire table across the entire cluster all in one go.  It's been a *long* 
>> > time since I tried to do a full repair of an entire table without using 
>> > sub-ranges.  Is anyone here even doing that with clusters of non-trivial 
>> > size?  How long does a full repair of a 100 node cluster with 5TB / node 
>> > take even in the best case scenario?
>> 
>> I haven't checked the CEP yet so I may be missing out something but I think 
>> this effort doesn't need to be conflated with dense node support, to make 
>> this more approachable. I think prospective users would be OK with 
>> overprovisioning to make this feasible if needed. We could perhaps have size 
>> guardrails that limit the maximum table size per node when MVs are enabled. 
>> Ideally we should make it work for dense nodes if possible, but this 
>> shouldn't be a reason not to support the feature if it can be made to work 
>> reasonably with more resources.
>> 
>> I think the main issue with the current MV is about correctness, and the 
>> ultimate goal of the CEP must be to provide correctness guarantees, even if 
>> it has an inevitable performance hit. I think that the performance of the 
>> repair process is definitely an important consideration and it would be 
>> helpful to have some benchmarks to have an idea of how long this repair 
>> process would take for lightweight and denser tables.
>> 
>> On Wed, May 14, 2025 at 7:28 AM Jon Haddad  wrote:
>>> I've got several concerns around this repair process.
>>> 
>>> - The first thing I notice is that we're talking about repairing the entire 
>>> table across the entire cluster all in one go.  It's been a *long* time 
>>> since I tried to do a full repair of an entire table without using 
>>> sub-ranges.  Is anyone here even doing that with clusters of non trivial 
>>> size?  How long does a full repair of a 100 node cluster with 5TB / node 
>>> take even in the best case scenario?
>>> 
>>> - Even in a scenario where sub-range repair is supported, you'd have to 
>>> scan *every* sstable on the base table in order to construct the a merkle 
>>> tree, as we don't know in advance which SSTables contain the ranges that 
>>> the MV will.  That means a subrange repair would have to do a *ton* of IO.  
>>> Anyone who's mis-configured a sub-range incremental repair to use too many 
>>> ranges will probably be familiar with how long it can take to anti-compact 
>>> a bunch of SSTables.  With MV sub-range repair, we'd have even more 
>>> overhead, because we'd have to read in every SSTable, every time.  If we do 
>>> 10 subranges, we'll do 10x the IO of a normal repair.  I don't think this 
>>> is practical.
>>> 
>>> - Merkle trees make sense when you're comparing tables with the same 
>>> partition key, but I don't think they do when you're transforming a base 
>>> table to a view.  When there's a mis-match, what's transferred?  We have a 
>>> range of data in the MV, but now we have to go find that from the base 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Jon Haddad
I was thinking along the lines of mutation tracking too, but I have to
admit I haven't spent much time on reading through it, it's probably time I
did.  I'd read up on it, thanks for bringing it up.

One thing to consider is that 5TB is not particularly dense anymore, in the
world of Cassandra 5+.  10TB is even reasonable.  I'd consider 20TB dense -
but really, 5TB is really not that big of a deal.  If you're already
running 2-3TB / node with 4.0, you can easily run 5TB with 5.0.

Shout out to everyone that worked on UCS & Trie Memtables!!

I've probably spent 500 hours this year alone working to understand every
aspect of this and have tested clusters with over 10TB / node.  Consumer
grade NVMe drives are available at 4TB for under $300 and at the enterprise
level 10's of SSD TB are pretty common.  We should also be forward looking
and not release something that *can't possibly scale* as hardware scales.
We've been doing such a great job of improving our cost profile over the
last few years and I'd really hate to see the project get set back by a
feature that works against the economics of density scaling.

For example, if you have to scale 2x for the additional storage of the MV,
then double or quadruple again to reduce the density and address the
additional performance overhead, that's not going to be particularly great
end user experience.  For an operator to have to 4-8x their cluster size &
cost just to *prepare* for MVs, that's not great.  Imagine going from 100
nodes to 800 because you want to use a single materialized view.  Want to
use 2?  Now you're looking at what, a 16-32x increase in cluster size from
the original, just to ensure consistency?

I would *love* to be wrong about all my concerns, because I think usable
MVs would be a killer feature.  I also want to be realistic and not ship
another broken iteration that not only fails to work properly but is the
root cause of outages.

On the flip side, I don't want to spend so much time clutching my pearls
that I prevent good work from getting done, so I propose we update the CEP
a bit more with the following criteria:

1. As a first step, make MV repair pluggable.  I think it would be great if
the repair solution could be developed outside of C*, and could be dropped
in as an optional experimental add on for those willing to take a huge
risk.  This is a *major* endeavor that carries huge risk, let's try to
mitigate that risk a bit.
2. Add acceptance criteria added to the CEP that this has been tested at a
reasonably large scale, preferably with a base table dataset of *at least*
100TB (preferably more) & significant variance in both base & MV partition
size, prior to merge.

I'd be happy to donate some time to do the evaluation if someone with
deeper pockets than me is willing to pick up the AWS bill and buy me a few
beers.  (I'd be even happier to have someone sponsor my contribution)

If that's in there, then none of my above concerns matter, because the
proof will be in the successful implementation.  It also gives us room to
experiment with mutation tracking or other competing ideas that might come
up.  If we can't do point 2 then I can't see any way we should merge it
into the project so hopefully this is a small ask.

Thoughts?
Jon



On Wed, May 14, 2025 at 10:20 AM Paulo Motta 
wrote:

> I don't see mutation tracking [1] mentioned in this thread or in the
> CEP-48 description. Not sure this would fit into the scope of this
> initial CEP, but I have a feeling that mutation tracking could be
> potentially helpful to reconcile base tables and views ?
>
> For example, when both base and view updates are acknowledged then this
> could be somehow persisted in the view sstables mutation tracking
> summary[2] or similar metadata ? Then these updates would be skipped during
> view repair, considerably reducing the amount of work needed, since only
> un-acknowledged views updates would need to be reconciled.
>
> [1] -
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
> 
> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>
> On Wed, May 14, 2025 at 12:59 PM Paulo Motta 
> wrote:
>
>> > - The first thing I notice is that we're talking about repairing the
>> entire table across the entire cluster all in one go.  It's been a *long*
>> time since I tried to do a full repair of an entire table without using
>> sub-ranges.  Is anyone here even doing that with clusters of non-trivial
>> size?  How long does a full repair of a 100 node cluster with 5TB / node
>> take even in the best case scenario?
>>
>> I haven't checked the CEP yet so I may be missing out something but I
>> think this effort doesn't need to be conflated with dense node support, to
>> make this more approachable. I think prospective users would be OK with
>> overprovisioning to make this feasible if needed. We could perhaps have
>> size guardrails that limit

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Paulo Motta
I don't see mutation tracking [1] mentioned in this thread or in the CEP-48
description. Not sure this would fit into the scope of this initial CEP,
but I have a feeling that mutation tracking could be potentially helpful to
reconcile base tables and views ?

For example, when both base and view updates are acknowledged then this
could be somehow persisted in the view sstables mutation tracking
summary[2] or similar metadata ? Then these updates would be skipped during
view repair, considerably reducing the amount of work needed, since only
un-acknowledged views updates would need to be reconciled.

[1] -
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
[2] - https://issues.apache.org/jira/browse/CASSANDRA-20336

On Wed, May 14, 2025 at 12:59 PM Paulo Motta 
wrote:

> > - The first thing I notice is that we're talking about repairing the
> entire table across the entire cluster all in one go.  It's been a *long*
> time since I tried to do a full repair of an entire table without using
> sub-ranges.  Is anyone here even doing that with clusters of non-trivial
> size?  How long does a full repair of a 100 node cluster with 5TB / node
> take even in the best case scenario?
>
> I haven't checked the CEP yet so I may be missing out something but I
> think this effort doesn't need to be conflated with dense node support, to
> make this more approachable. I think prospective users would be OK with
> overprovisioning to make this feasible if needed. We could perhaps have
> size guardrails that limit the maximum table size per node when MVs are
> enabled. Ideally we should make it work for dense nodes if possible, but
> this shouldn't be a reason not to support the feature if it can be made to
> work reasonably with more resources.
>
> I think the main issue with the current MV is about correctness, and the
> ultimate goal of the CEP must be to provide correctness guarantees, even if
> it has an inevitable performance hit. I think that the performance of the
> repair process is definitely an important consideration and it would be
> helpful to have some benchmarks to have an idea of how long this repair
> process would take for lightweight and denser tables.
>
> On Wed, May 14, 2025 at 7:28 AM Jon Haddad 
> wrote:
>
>> I've got several concerns around this repair process.
>>
>> - The first thing I notice is that we're talking about repairing the
>> entire table across the entire cluster all in one go.  It's been a *long*
>> time since I tried to do a full repair of an entire table without using
>> sub-ranges.  Is anyone here even doing that with clusters of non trivial
>> size?  How long does a full repair of a 100 node cluster with 5TB / node
>> take even in the best case scenario?
>>
>> - Even in a scenario where sub-range repair is supported, you'd have to
>> scan *every* sstable on the base table in order to construct the a merkle
>> tree, as we don't know in advance which SSTables contain the ranges that
>> the MV will.  That means a subrange repair would have to do a *ton* of IO.
>> Anyone who's mis-configured a sub-range incremental repair to use too many
>> ranges will probably be familiar with how long it can take to anti-compact
>> a bunch of SSTables.  With MV sub-range repair, we'd have even more
>> overhead, because we'd have to read in every SSTable, every time.  If we do
>> 10 subranges, we'll do 10x the IO of a normal repair.  I don't think this
>> is practical.
>>
>> - Merkle trees make sense when you're comparing tables with the same
>> partition key, but I don't think they do when you're transforming a base
>> table to a view.  When there's a mis-match, what's transferred?  We have a
>> range of data in the MV, but now we have to go find that from the base
>> table.  That means the merkle tree needs to not just track the hashes and
>> ranges, but the original keys it was transformed from, in order to go find
>> all of the matching partitions in that mis-matched range.  Either that or
>> we end up rescanning the entire dataset in order to find the mismatches.
>>
>> Jon
>>
>>
>>
>>
>> On Tue, May 13, 2025 at 10:29 AM Runtian Liu  wrote:
>>
>>> > Looking at the details of the CEP it seems to describe Paxos as
>>> PaxosV1, but PaxosV2 works slightly differently (it can read during the
>>> prepare phase). I assume that supporting Paxos means supporting both V1 and
>>> V2 for materialized views?
>>> We are going to support Paxos V2. The CEP is not clear on that, we add
>>> this to clarify that.
>>>
>>> It looks like the online portion is now fairly well understood.  For the
>>> offline repair part, I see two main concerns: one around the scalability of
>>> the proposed approach, and another regarding how it handles tombstones.
>>>
>>> Scalability:
>>> I have added a section
>>> 
>>> in the CEP w

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Paulo Motta
> - The first thing I notice is that we're talking about repairing the
entire table across the entire cluster all in one go.  It's been a *long*
time since I tried to do a full repair of an entire table without using
sub-ranges.  Is anyone here even doing that with clusters of non-trivial
size?  How long does a full repair of a 100 node cluster with 5TB / node
take even in the best case scenario?

I haven't checked the CEP yet so I may be missing out something but I think
this effort doesn't need to be conflated with dense node support, to make
this more approachable. I think prospective users would be OK with
overprovisioning to make this feasible if needed. We could perhaps have
size guardrails that limit the maximum table size per node when MVs are
enabled. Ideally we should make it work for dense nodes if possible, but
this shouldn't be a reason not to support the feature if it can be made to
work reasonably with more resources.

I think the main issue with the current MV is about correctness, and the
ultimate goal of the CEP must be to provide correctness guarantees, even if
it has an inevitable performance hit. I think that the performance of the
repair process is definitely an important consideration and it would be
helpful to have some benchmarks to have an idea of how long this repair
process would take for lightweight and denser tables.

On Wed, May 14, 2025 at 7:28 AM Jon Haddad  wrote:

> I've got several concerns around this repair process.
>
> - The first thing I notice is that we're talking about repairing the
> entire table across the entire cluster all in one go.  It's been a *long*
> time since I tried to do a full repair of an entire table without using
> sub-ranges.  Is anyone here even doing that with clusters of non trivial
> size?  How long does a full repair of a 100 node cluster with 5TB / node
> take even in the best case scenario?
>
> - Even in a scenario where sub-range repair is supported, you'd have to
> scan *every* sstable on the base table in order to construct the a merkle
> tree, as we don't know in advance which SSTables contain the ranges that
> the MV will.  That means a subrange repair would have to do a *ton* of IO.
> Anyone who's mis-configured a sub-range incremental repair to use too many
> ranges will probably be familiar with how long it can take to anti-compact
> a bunch of SSTables.  With MV sub-range repair, we'd have even more
> overhead, because we'd have to read in every SSTable, every time.  If we do
> 10 subranges, we'll do 10x the IO of a normal repair.  I don't think this
> is practical.
>
> - Merkle trees make sense when you're comparing tables with the same
> partition key, but I don't think they do when you're transforming a base
> table to a view.  When there's a mis-match, what's transferred?  We have a
> range of data in the MV, but now we have to go find that from the base
> table.  That means the merkle tree needs to not just track the hashes and
> ranges, but the original keys it was transformed from, in order to go find
> all of the matching partitions in that mis-matched range.  Either that or
> we end up rescanning the entire dataset in order to find the mismatches.
>
> Jon
>
>
>
>
> On Tue, May 13, 2025 at 10:29 AM Runtian Liu  wrote:
>
>> > Looking at the details of the CEP it seems to describe Paxos as
>> PaxosV1, but PaxosV2 works slightly differently (it can read during the
>> prepare phase). I assume that supporting Paxos means supporting both V1 and
>> V2 for materialized views?
>> We are going to support Paxos V2. The CEP is not clear on that, we add
>> this to clarify that.
>>
>> It looks like the online portion is now fairly well understood.  For the
>> offline repair part, I see two main concerns: one around the scalability of
>> the proposed approach, and another regarding how it handles tombstones.
>>
>> Scalability:
>> I have added a section
>> 
>> in the CEP with an example to compare full repair and the proposed MV
>> repair, the overall scalability should not be a problem.
>>
>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes,
>> where each node owns one token. The base table uses (pk, ck) as its primary
>> key, while the materialized view (MV) uses (ck, pk) as its primary key.
>> Both tables include a value column v, which allows us to correlate rows
>> between them. The dataset consists of 16 records, distributed as follows:
>>
>> *Base table*
>> (pk, ck, v)
>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1
>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2
>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3
>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4
>>
>> *Materialized view*
>> (ck, pk, v)
>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1
>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2
>> (3, 1, 3), 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Jon Haddad
Putting this another way - what we're trying to achieve here is comparing
two massive unordered sets.  I believe the worst case is a O(N^2)
complexity, and N can be in the billions.  I really think we need a
completely different approach here than how repair currently works.

For me to be a +1 on this, I'd need to see some evidence that this
implementation won't completely destroy a cluster of any reasonable size.
I realize this puts quite a bit on you, but I think this has the potential
to be one of the biggest foot guns we could possibly merge in.

I could be missing something here - happy to be proven wrong.

Tangentially, I'd love to see the option for MVs to be local indexes
instead of global - essentially an evolution of legacy 2i.  We'd still have
to solve all the repair related problems, but at least we could do it
without hammering the network.  Add the ability to slap a SAI index on the
local MV while we're at it, that's a killer feature.

If we can't figure out how to repair a local MV, then there's no way we can
do it globally.

Jon



On Wed, May 14, 2025 at 4:28 AM Jon Haddad  wrote:

> I've got several concerns around this repair process.
>
> - The first thing I notice is that we're talking about repairing the
> entire table across the entire cluster all in one go.  It's been a *long*
> time since I tried to do a full repair of an entire table without using
> sub-ranges.  Is anyone here even doing that with clusters of non trivial
> size?  How long does a full repair of a 100 node cluster with 5TB / node
> take even in the best case scenario?
>
> - Even in a scenario where sub-range repair is supported, you'd have to
> scan *every* sstable on the base table in order to construct the a merkle
> tree, as we don't know in advance which SSTables contain the ranges that
> the MV will.  That means a subrange repair would have to do a *ton* of IO.
> Anyone who's mis-configured a sub-range incremental repair to use too many
> ranges will probably be familiar with how long it can take to anti-compact
> a bunch of SSTables.  With MV sub-range repair, we'd have even more
> overhead, because we'd have to read in every SSTable, every time.  If we do
> 10 subranges, we'll do 10x the IO of a normal repair.  I don't think this
> is practical.
>
> - Merkle trees make sense when you're comparing tables with the same
> partition key, but I don't think they do when you're transforming a base
> table to a view.  When there's a mis-match, what's transferred?  We have a
> range of data in the MV, but now we have to go find that from the base
> table.  That means the merkle tree needs to not just track the hashes and
> ranges, but the original keys it was transformed from, in order to go find
> all of the matching partitions in that mis-matched range.  Either that or
> we end up rescanning the entire dataset in order to find the mismatches.
>
> Jon
>
>
>
>
> On Tue, May 13, 2025 at 10:29 AM Runtian Liu  wrote:
>
>> > Looking at the details of the CEP it seems to describe Paxos as
>> PaxosV1, but PaxosV2 works slightly differently (it can read during the
>> prepare phase). I assume that supporting Paxos means supporting both V1 and
>> V2 for materialized views?
>> We are going to support Paxos V2. The CEP is not clear on that, we add
>> this to clarify that.
>>
>> It looks like the online portion is now fairly well understood.  For the
>> offline repair part, I see two main concerns: one around the scalability of
>> the proposed approach, and another regarding how it handles tombstones.
>>
>> Scalability:
>> I have added a section
>> 
>> in the CEP with an example to compare full repair and the proposed MV
>> repair, the overall scalability should not be a problem.
>>
>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes,
>> where each node owns one token. The base table uses (pk, ck) as its primary
>> key, while the materialized view (MV) uses (ck, pk) as its primary key.
>> Both tables include a value column v, which allows us to correlate rows
>> between them. The dataset consists of 16 records, distributed as follows:
>>
>> *Base table*
>> (pk, ck, v)
>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1
>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2
>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3
>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4
>>
>> *Materialized view*
>> (ck, pk, v)
>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1
>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2
>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3
>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4
>>
>> The chart below compares one round of full repair with one round of MV
>> repair. As shown, both scan the same total number of rows. However, MV
>> repair has higher time complexity because its Merkle tree 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-14 Thread Jon Haddad
I've got several concerns around this repair process.

- The first thing I notice is that we're talking about repairing the entire
table across the entire cluster all in one go.  It's been a *long* time
since I tried to do a full repair of an entire table without using
sub-ranges.  Is anyone here even doing that with clusters of non trivial
size?  How long does a full repair of a 100 node cluster with 5TB / node
take even in the best case scenario?

- Even in a scenario where sub-range repair is supported, you'd have to
scan *every* sstable on the base table in order to construct the a merkle
tree, as we don't know in advance which SSTables contain the ranges that
the MV will.  That means a subrange repair would have to do a *ton* of IO.
Anyone who's mis-configured a sub-range incremental repair to use too many
ranges will probably be familiar with how long it can take to anti-compact
a bunch of SSTables.  With MV sub-range repair, we'd have even more
overhead, because we'd have to read in every SSTable, every time.  If we do
10 subranges, we'll do 10x the IO of a normal repair.  I don't think this
is practical.

- Merkle trees make sense when you're comparing tables with the same
partition key, but I don't think they do when you're transforming a base
table to a view.  When there's a mis-match, what's transferred?  We have a
range of data in the MV, but now we have to go find that from the base
table.  That means the merkle tree needs to not just track the hashes and
ranges, but the original keys it was transformed from, in order to go find
all of the matching partitions in that mis-matched range.  Either that or
we end up rescanning the entire dataset in order to find the mismatches.

Jon




On Tue, May 13, 2025 at 10:29 AM Runtian Liu  wrote:

> > Looking at the details of the CEP it seems to describe Paxos as
> PaxosV1, but PaxosV2 works slightly differently (it can read during the
> prepare phase). I assume that supporting Paxos means supporting both V1 and
> V2 for materialized views?
> We are going to support Paxos V2. The CEP is not clear on that, we add
> this to clarify that.
>
> It looks like the online portion is now fairly well understood.  For the
> offline repair part, I see two main concerns: one around the scalability of
> the proposed approach, and another regarding how it handles tombstones.
>
> Scalability:
> I have added a section
> 
> in the CEP with an example to compare full repair and the proposed MV
> repair, the overall scalability should not be a problem.
>
> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes, where
> each node owns one token. The base table uses (pk, ck) as its primary key,
> while the materialized view (MV) uses (ck, pk) as its primary key. Both
> tables include a value column v, which allows us to correlate rows between
> them. The dataset consists of 16 records, distributed as follows:
>
> *Base table*
> (pk, ck, v)
> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1
> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2
> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3
> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4
>
> *Materialized view*
> (ck, pk, v)
> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1
> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2
> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3
> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4
>
> The chart below compares one round of full repair with one round of MV
> repair. As shown, both scan the same total number of rows. However, MV
> repair has higher time complexity because its Merkle tree processes each
> row more intensively. To avoid all nodes scanning the entire table
> simultaneously, MV repair should use a snapshot-based approach, similar to
> normal repair with the --sequential option. Time complexity increase
> compare to full repair can be found in the "Complexity and Memory
> Management" section.
>
> n: number of rows
>
> d: depth of one Merkle tree for MV repair
>
> d': depth of one Merkle tree for full repair
>
> r: number of split ranges
>
> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r.
>
> We can see that the space complexity is the same, while MV repair has
> higher time complexity. However, this should not pose a significant issue
> in production, as the Merkle tree depth and the number of split ranges are
> typically not large.
>
> 1 Round Merkle Tree Building Complexity
> Full Repair
> MV Repair
> Time complexity O(n) O(n*d*log(r))
> Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r)
>
> Tombstone:
>
> The current proposal focuses on rebuilding the MV for a granular token
> range where a mismatch is detected, rather than rebuilding the entire MV
> token range. Since the MV is treated as a regular table, standard full or
> incremental repair

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-13 Thread Runtian Liu
> Looking at the details of the CEP it seems to describe Paxos as PaxosV1,
but PaxosV2 works slightly differently (it can read during the prepare
phase). I assume that supporting Paxos means supporting both V1 and V2 for
materialized views?
We are going to support Paxos V2. The CEP is not clear on that, we add this
to clarify that.

It looks like the online portion is now fairly well understood.  For the
offline repair part, I see two main concerns: one around the scalability of
the proposed approach, and another regarding how it handles tombstones.

Scalability:
I have added a section

in the CEP with an example to compare full repair and the proposed MV
repair, the overall scalability should not be a problem.

Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes, where
each node owns one token. The base table uses (pk, ck) as its primary key,
while the materialized view (MV) uses (ck, pk) as its primary key. Both
tables include a value column v, which allows us to correlate rows between
them. The dataset consists of 16 records, distributed as follows:

*Base table*
(pk, ck, v)
(1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1
(2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2
(3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3
(4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4

*Materialized view*
(ck, pk, v)
(1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1
(2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2
(3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3
(4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4

The chart below compares one round of full repair with one round of MV
repair. As shown, both scan the same total number of rows. However, MV
repair has higher time complexity because its Merkle tree processes each
row more intensively. To avoid all nodes scanning the entire table
simultaneously, MV repair should use a snapshot-based approach, similar to
normal repair with the --sequential option. Time complexity increase
compare to full repair can be found in the "Complexity and Memory
Management" section.

n: number of rows

d: depth of one Merkle tree for MV repair

d': depth of one Merkle tree for full repair

r: number of split ranges

Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r.

We can see that the space complexity is the same, while MV repair has
higher time complexity. However, this should not pose a significant issue
in production, as the Merkle tree depth and the number of split ranges are
typically not large.

1 Round Merkle Tree Building Complexity
Full Repair
MV Repair
Time complexity O(n) O(n*d*log(r))
Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r)

Tombstone:

The current proposal focuses on rebuilding the MV for a granular token
range where a mismatch is detected, rather than rebuilding the entire MV
token range. Since the MV is treated as a regular table, standard full or
incremental repair processes should still apply to both the base and MV
tables to keep their replicas in sync.

Regarding tombstones, if we introduce special tombstone types or handling
mechanisms for the MV table, we may be able to support tombstone
synchronization between the base table and the MV. I plan to spend more
time exploring whether we can introduce changes to the base table that
enable this synchronization.



On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia <
[email protected]> wrote:

> >Like something doesn't add up here because if it always includes the
> base table's primary key columns that means
>
> The requirement for materialized views (MVs) to include the base table's
> primary key appears to be primarily a syntactic constraint specific to
> Apache Cassandra. For instance, in DynamoDB, the DDL for defining a Global
> Secondary Index does not mandate inclusion of the base table's primary key.
> This suggests that the syntax requirement in Cassandra could potentially be
> relaxed in the future (outside the scope of this CEP). As Benedict noted,
> the base table's primary key is optional when querying a materialized view.
>
> Jaydeep
>
> On Mon, May 12, 2025 at 10:45 AM Jon Haddad 
> wrote:
>
>>
>> > Or compaction hasn’t made a mistake, or cell merge reconciliation
>> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file.
>> > Repair isnt’ just about “have all transaction commits landed”. It’s “is
>> the data correct N days after it’s written”.
>>
>> Don't forget about restoring from a backup.
>>
>> Is there a way we could do some sort of hybrid compaction + incremental
>> repair?  Maybe have the MV verify it's view while it's compacting, and when
>> it's done, mark the view's SSTable as repaired?  Then the repair process
>> would only need to do a MV to MV repair.
>>
>> Jon
>>
>>
>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith <
>> bene

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-12 Thread Jaydeep Chovatia
>Like something doesn't add up here because if it always includes the base
table's primary key columns that means

The requirement for materialized views (MVs) to include the base table's
primary key appears to be primarily a syntactic constraint specific to
Apache Cassandra. For instance, in DynamoDB, the DDL for defining a Global
Secondary Index does not mandate inclusion of the base table's primary key.
This suggests that the syntax requirement in Cassandra could potentially be
relaxed in the future (outside the scope of this CEP). As Benedict noted,
the base table's primary key is optional when querying a materialized view.

Jaydeep

On Mon, May 12, 2025 at 10:45 AM Jon Haddad  wrote:

>
> > Or compaction hasn’t made a mistake, or cell merge reconciliation
> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file.
> > Repair isnt’ just about “have all transaction commits landed”. It’s “is
> the data correct N days after it’s written”.
>
> Don't forget about restoring from a backup.
>
> Is there a way we could do some sort of hybrid compaction + incremental
> repair?  Maybe have the MV verify it's view while it's compacting, and when
> it's done, mark the view's SSTable as repaired?  Then the repair process
> would only need to do a MV to MV repair.
>
> Jon
>
>
> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith <
> [email protected]> wrote:
>
>> Like something doesn't add up here because if it always includes the base
>> table's primary key columns that means they could be storage attached by
>> just forbidding additional columns and there doesn't seem to be much
>> utility in including additional columns in the primary key?
>>
>>
>> You can re-order the keys, and they only need to be a part of the primary
>> key not the partition key. I think you can specify an arbitrary order to
>> the keys also, so you can change the effective sort order. So, the basic
>> idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).
>>
>> This is basically a global index, with the restriction on single columns
>> as keys only because we cannot cheaply read-before-write for eventually
>> consistent operations. This restriction can easily be relaxed for Paxos and
>> Accord based implementations, which can also safely include additional keys.
>>
>> That said, I am not at all sure why they are called materialised views if
>> we don’t support including any other data besides the lookup column and the
>> primary key. We should really rename them once they work, both to make some
>> sense and to break with the historical baggage.
>>
>> I think this can be represented as a tombstone which can always be
>> fetched from the base table on read or maybe some other arrangement? I
>> agree it can't feasibly be represented as an enumeration of the deletions
>> at least not synchronously and doing it async has its own problems.
>>
>>
>> If the base table must be read on read of an index/view, then I think
>> this proposal is approximately linearizable for the view as well (though, I
>> do not at all warrant this statement). You still need to propagate this
>> eventually so that the views can cleanup. This also makes reads 2RT on
>> read, which is rather costly.
>>
>> On 12 May 2025, at 16:10, Ariel Weisberg  wrote:
>>
>> Hi,
>>
>> I think it's worth taking a step back and looking at the current MV
>> restrictions which are pretty onerous.
>>
>> A view must have a primary key and that primary key must conform to the
>> following restrictions:
>>
>>- it must contain all the primary key columns of the base table. This
>>ensures that every row of the view correspond to exactly one row of the
>>base table.
>>- it can only contain a single column that is not a primary key
>>column in the base table.
>>
>> At that point what exactly is the value in including anything except the
>> original primary key in the MV's primary key columns unless you are using
>> an ordered partitioner so you can iterate based on the leading primary key
>> columns?
>>
>> Like something doesn't add up here because if it always includes the base
>> table's primary key columns that means they could be storage attached by
>> just forbidding additional columns and there doesn't seem to be much
>> utility in including additional columns in the primary key?
>>
>> I'm not that clear on how much better it is to look something up in the
>> MV vs just looking at the base table or some non-materialized view of it.
>> How exactly are these MVs supposed to be used and what value do they
>> provide?
>>
>> Jeff Jirsa wrote:
>>
>> There’s 2 things in this proposal that give me a lot of pause.
>>
>>
>> Runtian Liu pointed out that the CEP is sort of divided into two parts.
>> The first is the online part which is making reads/writes to MVs safer and
>> more reliable using a transaction system. The second is offline which is
>> repair.
>>
>> The story for the online portion I think is quite strong and worth
>> considering on its own

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-12 Thread Jon Haddad
> Or compaction hasn’t made a mistake, or cell merge reconciliation hasn’t
made a mistake, or volume bitrot hasn’t caused you to lose a file.
> Repair isnt’ just about “have all transaction commits landed”. It’s “is
the data correct N days after it’s written”.

Don't forget about restoring from a backup.

Is there a way we could do some sort of hybrid compaction + incremental
repair?  Maybe have the MV verify it's view while it's compacting, and when
it's done, mark the view's SSTable as repaired?  Then the repair process
would only need to do a MV to MV repair.

Jon


On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith 
wrote:

> Like something doesn't add up here because if it always includes the base
> table's primary key columns that means they could be storage attached by
> just forbidding additional columns and there doesn't seem to be much
> utility in including additional columns in the primary key?
>
>
> You can re-order the keys, and they only need to be a part of the primary
> key not the partition key. I think you can specify an arbitrary order to
> the keys also, so you can change the effective sort order. So, the basic
> idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).
>
> This is basically a global index, with the restriction on single columns
> as keys only because we cannot cheaply read-before-write for eventually
> consistent operations. This restriction can easily be relaxed for Paxos and
> Accord based implementations, which can also safely include additional keys.
>
> That said, I am not at all sure why they are called materialised views if
> we don’t support including any other data besides the lookup column and the
> primary key. We should really rename them once they work, both to make some
> sense and to break with the historical baggage.
>
> I think this can be represented as a tombstone which can always be fetched
> from the base table on read or maybe some other arrangement? I agree it
> can't feasibly be represented as an enumeration of the deletions at least
> not synchronously and doing it async has its own problems.
>
>
> If the base table must be read on read of an index/view, then I think this
> proposal is approximately linearizable for the view as well (though, I do
> not at all warrant this statement). You still need to propagate this
> eventually so that the views can cleanup. This also makes reads 2RT on
> read, which is rather costly.
>
> On 12 May 2025, at 16:10, Ariel Weisberg  wrote:
>
> Hi,
>
> I think it's worth taking a step back and looking at the current MV
> restrictions which are pretty onerous.
>
> A view must have a primary key and that primary key must conform to the
> following restrictions:
>
>- it must contain all the primary key columns of the base table. This
>ensures that every row of the view correspond to exactly one row of the
>base table.
>- it can only contain a single column that is not a primary key column
>in the base table.
>
> At that point what exactly is the value in including anything except the
> original primary key in the MV's primary key columns unless you are using
> an ordered partitioner so you can iterate based on the leading primary key
> columns?
>
> Like something doesn't add up here because if it always includes the base
> table's primary key columns that means they could be storage attached by
> just forbidding additional columns and there doesn't seem to be much
> utility in including additional columns in the primary key?
>
> I'm not that clear on how much better it is to look something up in the MV
> vs just looking at the base table or some non-materialized view of it. How
> exactly are these MVs supposed to be used and what value do they provide?
>
> Jeff Jirsa wrote:
>
> There’s 2 things in this proposal that give me a lot of pause.
>
>
> Runtian Liu pointed out that the CEP is sort of divided into two parts.
> The first is the online part which is making reads/writes to MVs safer and
> more reliable using a transaction system. The second is offline which is
> repair.
>
> The story for the online portion I think is quite strong and worth
> considering on its own merits.
>
> The offline portion (repair) sounds a little less feasible to run in
> production, but I also think that MVs without any mechanism for checking
> their consistency are not viable to run in production. So it's kind of pay
> for what you use in terms of the feature?
>
> It's definitely worth thinking through if there is a way to fix one side
> of this equation so it works better.
>
> David Capwell wrote:
>
> As far as I can tell, being based off Accord means you don’t need to care
> about repair, as Accord will manage the consistency for you; you can’t get
> out of sync.
>
> I think a baseline requirement in C* for something to be in production is
> to be able to run preview repair and validate that the transaction system
> or any other part of Cassandra hasn't made a mistake. Divergence can have
> many sourc

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-12 Thread Benedict Elliott Smith
> Like something doesn't add up here because if it always includes the base 
> table's primary key columns that means they could be storage attached by just 
> forbidding additional columns and there doesn't seem to be much utility in 
> including additional columns in the primary key?

You can re-order the keys, and they only need to be a part of the primary key 
not the partition key. I think you can specify an arbitrary order to the keys 
also, so you can change the effective sort order. So, the basic idea is you 
stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).

This is basically a global index, with the restriction on single columns as 
keys only because we cannot cheaply read-before-write for eventually consistent 
operations. This restriction can easily be relaxed for Paxos and Accord based 
implementations, which can also safely include additional keys.

That said, I am not at all sure why they are called materialised views if we 
don’t support including any other data besides the lookup column and the 
primary key. We should really rename them once they work, both to make some 
sense and to break with the historical baggage.

> I think this can be represented as a tombstone which can always be fetched 
> from the base table on read or maybe some other arrangement? I agree it can't 
> feasibly be represented as an enumeration of the deletions at least not 
> synchronously and doing it async has its own problems.

If the base table must be read on read of an index/view, then I think this 
proposal is approximately linearizable for the view as well (though, I do not 
at all warrant this statement). You still need to propagate this eventually so 
that the views can cleanup. This also makes reads 2RT on read, which is rather 
costly.

> On 12 May 2025, at 16:10, Ariel Weisberg  wrote:
> 
> Hi,
> 
> I think it's worth taking a step back and looking at the current MV 
> restrictions which are pretty onerous.
> 
> A view must have a primary key and that primary key must conform to the 
> following restrictions:
> it must contain all the primary key columns of the base table. This ensures 
> that every row of the view correspond to exactly one row of the base table.
> it can only contain a single column that is not a primary key column in the 
> base table.
> At that point what exactly is the value in including anything except the 
> original primary key in the MV's primary key columns unless you are using an 
> ordered partitioner so you can iterate based on the leading primary key 
> columns?
> 
> Like something doesn't add up here because if it always includes the base 
> table's primary key columns that means they could be storage attached by just 
> forbidding additional columns and there doesn't seem to be much utility in 
> including additional columns in the primary key?
> 
> I'm not that clear on how much better it is to look something up in the MV vs 
> just looking at the base table or some non-materialized view of it. How 
> exactly are these MVs supposed to be used and what value do they provide?
> 
> Jeff Jirsa wrote:
>> There’s 2 things in this proposal that give me a lot of pause.
> 
> Runtian Liu pointed out that the CEP is sort of divided into two parts. The 
> first is the online part which is making reads/writes to MVs safer and more 
> reliable using a transaction system. The second is offline which is repair.
> 
> The story for the online portion I think is quite strong and worth 
> considering on its own merits.
> 
> The offline portion (repair) sounds a little less feasible to run in 
> production, but I also think that MVs without any mechanism for checking 
> their consistency are not viable to run in production. So it's kind of pay 
> for what you use in terms of the feature?
> 
> It's definitely worth thinking through if there is a way to fix one side of 
> this equation so it works better.
> 
> David Capwell wrote:
>> As far as I can tell, being based off Accord means you don’t need to care 
>> about repair, as Accord will manage the consistency for you; you can’t get 
>> out of sync.
> I think a baseline requirement in C* for something to be in production is to 
> be able to run preview repair and validate that the transaction system or any 
> other part of Cassandra hasn't made a mistake. Divergence can have many 
> sources including Accord.
> 
> Runtian Liu wrote:
>> For the example David mentioned, LWT cannot support. Since LWTs operate on a 
>> single token, we’ll need to restrict base-table updates to one partition—and 
>> ideally one row—at a time. A current MV base-table command can delete an 
>> entire partition, but doing so might touch hundreds of MV partitions, making 
>> consistency guarantees impossible. 
> I think this can be represented as a tombstone which can always be fetched 
> from the base table on read or maybe some other arrangement? I agree it can't 
> feasibly be represented as an enumeration of the deletions at least not 
> synchronously and doi

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-12 Thread Jeff Jirsa


> On May 12, 2025, at 8:10 AM, Ariel Weisberg  wrote:
> 
> Hi,
> 
> I think it's worth taking a step back and looking at the current MV 
> restrictions which are pretty onerous.
> 
> A view must have a primary key and that primary key must conform to the 
> following restrictions:
> it must contain all the primary key columns of the base table. This ensures 
> that every row of the view correspond to exactly one row of the base table.
> it can only contain a single column that is not a primary key column in the 
> base table.
> At that point what exactly is the value in including anything except the 
> original primary key in the MV's primary key columns unless you are using an 
> ordered partitioner so you can iterate based on the leading primary key 
> columns?
> 
> Like something doesn't add up here because if it always includes the base 
> table's primary key columns that means they could be storage attached by just 
> forbidding additional columns and there doesn't seem to be much utility in 
> including additional columns in the primary key?
> 
> I'm not that clear on how much better it is to look something up in the MV vs 
> just looking at the base table or some non-materialized view of it. How 
> exactly are these MVs supposed to be used and what value do they provide?
> 
> Jeff Jirsa wrote:
>> There’s 2 things in this proposal that give me a lot of pause.
> 
> Runtian Liu pointed out that the CEP is sort of divided into two parts. The 
> first is the online part which is making reads/writes to MVs safer and more 
> reliable using a transaction system. The second is offline which is repair.
> 
> The story for the online portion I think is quite strong and worth 
> considering on its own merits.
> 
> The offline portion (repair) sounds a little less feasible to run in 
> production, but I also think that MVs without any mechanism for checking 
> their consistency are not viable to run in production. So it's kind of pay 
> for what you use in terms of the feature?
> 
> It's definitely worth thinking through if there is a way to fix one side of 
> this equation so it works better.

Agree that we need a solution. I just don’t think a massive number of merkle 
trees without tombstones is actually going to be materially better (or rather, 
it’s a massive foot gun, it’s going to blow up people who read the CEP as “now 
it’s safe to use”). 

> 
> David Capwell wrote:
>> As far as I can tell, being based off Accord means you don’t need to care 
>> about repair, as Accord will manage the consistency for you; you can’t get 
>> out of sync.
> I think a baseline requirement in C* for something to be in production is to 
> be able to run preview repair and validate that the transaction system or any 
> other part of Cassandra hasn't made a mistake. Divergence can have many 
> sources including Accord.

Or compaction hasn’t made a mistake, or cell merge reconciliation hasn’t made a 
mistake, or volume bitrot hasn’t caused you to lose a file.

Repair isnt’ just about “have all transaction commits landed”. It’s “is the 
data correct N days after it’s written”. 

> 
> Runtian Liu wrote:
>> For the example David mentioned, LWT cannot support. Since LWTs operate on a 
>> single token, we’ll need to restrict base-table updates to one partition—and 
>> ideally one row—at a time. A current MV base-table command can delete an 
>> entire partition, but doing so might touch hundreds of MV partitions, making 
>> consistency guarantees impossible. 
> I think this can be represented as a tombstone which can always be fetched 
> from the base table on read or maybe some other arrangement? I agree it can't 
> feasibly be represented as an enumeration of the deletions at least not 
> synchronously and doing it async has its own problems.
> 
> Ariel
> 
> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
>> 
>> 
>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg  wrote:
>>> 
>>> 
>>> I am *big* fan of getting repair really working with MVs. It does seem 
>>> problematic that the number of merkle trees will be equal to the number of 
>>> ranges in the cluster and repair of MVs would become an all node operation. 
>>>  How would down nodes be handled and how many nodes would simultaneously 
>>> working to validate a given base table range at once? How many base table 
>>> ranges could simultaneously be repairing MVs?
>>> 
>>> If a row containing a column that creates an MV partition is deleted, and 
>>> the MV isn't updated, then how does the merkle tree approach propagate the 
>>> deletion to the MV? The CEP says that anti-compaction would remove extra 
>>> rows, but I am not clear on how that works. When is anti-compaction 
>>> performed in the repair process and what is/isn't included in the outputs?
>> 
>> 
>> 
>> I thought about these two points last night after I sent my email.
>> 
>> There’s 2 things in this proposal that give me a lot of pause.
>> 
>> One is the lack of tombstones / deletions in the merle trees, which makes 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-12 Thread Ariel Weisberg
Hi,

I think it's worth taking a step back and looking at the current MV 
restrictions which are pretty onerous.

A view must have a primary key and that primary key must conform to the 
following restrictions:
 • it must contain all the primary key columns of the base table. This ensures 
that every row of the view correspond to exactly one row of the base table.
 • it can only contain a single column that is not a primary key column in the 
base table.
At that point what exactly is the value in including anything except the 
original primary key in the MV's primary key columns unless you are using an 
ordered partitioner so you can iterate based on the leading primary key columns?

Like something doesn't add up here because if it always includes the base 
table's primary key columns that means they could be storage attached by just 
forbidding additional columns and there doesn't seem to be much utility in 
including additional columns in the primary key?

I'm not that clear on how much better it is to look something up in the MV vs 
just looking at the base table or some non-materialized view of it. How exactly 
are these MVs supposed to be used and what value do they provide?

Jeff Jirsa wrote:
> There’s 2 things in this proposal that give me a lot of pause.

Runtian Liu pointed out that the CEP is sort of divided into two parts. The 
first is the online part which is making reads/writes to MVs safer and more 
reliable using a transaction system. The second is offline which is repair.

The story for the online portion I think is quite strong and worth considering 
on its own merits.

The offline portion (repair) sounds a little less feasible to run in 
production, but I also think that MVs without any mechanism for checking their 
consistency are not viable to run in production. So it's kind of pay for what 
you use in terms of the feature?

It's definitely worth thinking through if there is a way to fix one side of 
this equation so it works better.

David Capwell wrote:
> As far as I can tell, being based off Accord means you don’t need to care 
> about repair, as Accord will manage the consistency for you; you can’t get 
> out of sync.
I think a baseline requirement in C* for something to be in production is to be 
able to run preview repair and validate that the transaction system or any 
other part of Cassandra hasn't made a mistake. Divergence can have many sources 
including Accord.

Runtian Liu wrote:
> For the example David mentioned, LWT cannot support. Since LWTs operate on a 
> single token, we’ll need to restrict base-table updates to one partition—and 
> ideally one row—at a time. A current MV base-table command can delete an 
> entire partition, but doing so might touch hundreds of MV partitions, making 
> consistency guarantees impossible. 
I think this can be represented as a tombstone which can always be fetched from 
the base table on read or maybe some other arrangement? I agree it can't 
feasibly be represented as an enumeration of the deletions at least not 
synchronously and doing it async has its own problems.

Ariel

On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
> 
> 
>> On May 9, 2025, at 12:59 PM, Ariel Weisberg  wrote:
>> 
>> 
>> I am *big* fan of getting repair really working with MVs. It does seem 
>> problematic that the number of merkle trees will be equal to the number of 
>> ranges in the cluster and repair of MVs would become an all node operation.  
>> How would down nodes be handled and how many nodes would simultaneously 
>> working to validate a given base table range at once? How many base table 
>> ranges could simultaneously be repairing MVs?
>> 
>> If a row containing a column that creates an MV partition is deleted, and 
>> the MV isn't updated, then how does the merkle tree approach propagate the 
>> deletion to the MV? The CEP says that anti-compaction would remove extra 
>> rows, but I am not clear on how that works. When is anti-compaction 
>> performed in the repair process and what is/isn't included in the outputs?
> 
> 
> I thought about these two points last night after I sent my email.
> 
> There’s 2 things in this proposal that give me a lot of pause.
> 
> One is the lack of tombstones / deletions in the merle trees, which makes 
> properly dealing with writes/deletes/inconsistency very hard (afaict)
> 
> The second is the reality that repairing a single partition in the base table 
> may repair all hosts/ranges in the MV table, and vice versa. Basically 
> scanning either base or MV is effectively scanning the whole cluster (modulo 
> what you can avoid in the clean/dirty repaired sets). This makes me really, 
> really concerned with how it scales, and how likely it is to be able to 
> schedule automatically without blowing up. 
> 
> The paxos vs accord comments so far are interesting in that I think both 
> could be made to work, but I am very concerned about how the merkle tree 
> comparisons are likely to work with wide partitions leading t

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-11 Thread Josh McKenzie
> This makes me really, really concerned with how it scales, and how likely it 
> is to be able to schedule automatically without blowing up. 
It seems to me that resource-aware throttling would be the solution here, or 
from a more primitive case, just hard bounding threadpool size, throughput 
rate, etc. Worst-case you end up in a situation where your MV anti-entropy 
can't keep up and you surface that to the operator rather than bogging down 
and/or killing your entire cluster due to touching all the nodes.

This isn't a problem isolated to MV's; our lack of robust resource scheduling 
and balancing between operations is a long-standing problem that just scales 
linearly w/the number of nodes in the cluster you have to hit. Same problem 
would arise from a global index; there's no deep technical reason querying all 
the nodes in the cluster should be a third rail excepting we currently have no 
way to prevent that from compounding and bringing the cluster down. 

On Fri, May 9, 2025, at 9:11 PM, Runtian Liu wrote:
> I’ve added a new section on isolation and consistency 
> .
>  In our current design, materialized-view tables stay eventually consistent, 
> while the base table offers linearizability. Here, “strict consistency” 
> refers to linearizable base-table updates, with every successful write 
> ensuring that the corresponding MV change is applied and visible.
> 
> >Why mandate `LOCAL_QUORUM` instead of using the consistency level requested 
> >by the application? If they want to use `LOCAL_QUORUM` they can always 
> >request it.
> 
> I think you meant LOCAL_SERIAL? Right, LOCAL_SERIAL should not be mandatory 
> and users should be able to select which consistency to use. Updated the page 
> for this one.
> 
> For the example David mentioned, LWT cannot support. Since LWTs operate on a 
> single token, we’ll need to restrict base-table updates to one partition—and 
> ideally one row—at a time. A current MV base-table command can delete an 
> entire partition, but doing so might touch hundreds of MV partitions, making 
> consistency guarantees impossible. Limiting each operation’s scope lets us 
> ensure that every successful base-table write is accurately propagated to its 
> MV. Even with Accord backed MV, I think we will need to limit the number of 
> rows that get modified each time.
> 
> Regarding repair, due to bugs, operator errors, or hardware faults, MVs can 
> become out of sync with their base tables—regardless of the chosen 
> synchronization method during writes. The purpose of MV repair is to detect 
> and resolve these mismatches using the base table as the source of truth. As 
> a result, if data resurrection occurs in the base table, the repair process 
> will propagate that resurrected data to the MV.
> 
> >One is the lack of tombstones / deletions in the merle trees, which makes 
> >properly dealing with writes/deletes/inconsistency very hard (afaict)
> 
> Tombstones are excluded because a base table update can produce a tombstone 
> in the MV—for example, when the updated cell is part of the MV's primary key. 
> Since such tombstones may not exist in the base table, we can only compare 
> live data during MV repair.
> 
> 
> 
> > repairing a single partition in the base table may repair all hosts/ranges 
> > in the MV table,
> That’s correct. To avoid repeatedly scanning both tables, the proposed 
> solution is for all nodes to take a snapshot first. Then, each node scans the 
> base table once and the MV table once, generating a list of Merkle trees from 
> each scan. These lists are then compared to identify mismatches. This means 
> MV repair must be performed at the table level rather than one token range at 
> a time to be efficient.
> 
> >If a row containing a column that creates an MV partition is deleted, and 
> >the MV isn't updated, then how does the merkle tree approach propagate the 
> >deletion to the MV? The CEP says that anti-compaction would remove extra 
> >rows, but I am not clear on how that works. When is anti-compaction 
> >performed in the repair process and what is/isn't included in the outputs?
> 
> Let me illustrate this with an example:
> 
> We have the following base table and MV:
> 
> `CREATE TABLE base (pk int, ck int, v int, PRIMARY KEY (pk, ck));
> CREATE MATERIALIZED VIEW mv AS SELECT * FROM base PRIMARY KEY (ck, pk);
`
> Assume there are 100 rows in the base table (e.g., (1,1), (2,2), ..., 
> (100,100)), and accordingly, the MV also has 100 rows. Now, suppose the row 
> (55,55) is deleted from the base table, but due to some issue, it still 
> exists in the MV.
> 
> Let's say each Merkle tree covers 20 rows in both the base and MV tables, so 
> we have a 5x5 grid—25 Merkle tree comparisons in total. Suppose the repair 
> job detects a mismatch in the range base(40–59) vs MV(40–59).
> 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-09 Thread Runtian Liu
I’ve added a new section on isolation and consistency
.
In our current design, materialized-view tables stay eventually consistent,
while the base table offers linearizability. Here, “strict consistency”
refers to linearizable base-table updates, with every successful write
ensuring that the corresponding MV change is applied and visible.

>Why mandate `LOCAL_QUORUM` instead of using the consistency level
requested by the application? If they want to use `LOCAL_QUORUM` they can
always request it.

I think you meant LOCAL_SERIAL? Right, LOCAL_SERIAL should not be mandatory
and users should be able to select which consistency to use. Updated the
page for this one.

For the example David mentioned, LWT cannot support. Since LWTs operate on
a single token, we’ll need to restrict base-table updates to one
partition—and ideally one row—at a time. A current MV base-table command
can delete an entire partition, but doing so might touch hundreds of MV
partitions, making consistency guarantees impossible. Limiting each
operation’s scope lets us ensure that every successful base-table write is
accurately propagated to its MV. Even with Accord backed MV, I think we
will need to limit the number of rows that get modified each time.

Regarding repair, due to bugs, operator errors, or hardware faults, MVs can
become out of sync with their base tables—regardless of the chosen
synchronization method during writes. The purpose of MV repair is to detect
and resolve these mismatches using the base table as the source of truth.
As a result, if data resurrection occurs in the base table, the repair
process will propagate that resurrected data to the MV.

>One is the lack of tombstones / deletions in the merle trees, which makes
properly dealing with writes/deletes/inconsistency very hard (afaict)

Tombstones are excluded because a base table update can produce a tombstone
in the MV—for example, when the updated cell is part of the MV's primary
key. Since such tombstones may not exist in the base table, we can only
compare live data during MV repair.

> repairing a single partition in the base table may repair all
hosts/ranges in the MV table,

That’s correct. To avoid repeatedly scanning both tables, the proposed
solution is for all nodes to take a snapshot first. Then, each node scans
the base table once and the MV table once, generating a list of Merkle
trees from each scan. These lists are then compared to identify mismatches.
This means MV repair must be performed at the table level rather than one
token range at a time to be efficient.

>If a row containing a column that creates an MV partition is deleted, and
the MV isn't updated, then how does the merkle tree approach propagate the
deletion to the MV? The CEP says that anti-compaction would remove extra
rows, but I am not clear on how that works. When is anti-compaction
performed in the repair process and what is/isn't included in the outputs?

Let me illustrate this with an example:

We have the following base table and MV:

CREATE TABLE base (pk int, ck int, v int, PRIMARY KEY (pk, ck));CREATE
MATERIALIZED VIEW mv AS SELECT * FROM base PRIMARY KEY (ck, pk);

Assume there are 100 rows in the base table (e.g., (1,1), (2,2), ...,
(100,100)), and accordingly, the MV also has 100 rows. Now, suppose the row
(55,55) is deleted from the base table, but due to some issue, it still
exists in the MV.

Let's say each Merkle tree covers 20 rows in both the base and MV tables,
so we have a 5x5 grid—25 Merkle tree comparisons in total. Suppose the
repair job detects a mismatch in the range base(40–59) vs MV(40–59).

On the node that owns the MV range (40–59), anti-compaction will be
triggered. If all 100 rows were in a single SSTable, it would be split into
two SSTables: one containing the 20 rows in the (40–59) range, and the
other containing the remaining 80 rows.

On the base table side, the node will scan the (40–59) range, identify all
rows that map to the MV range (40–59)—which in this example would be 19
rows—and stream them to the MV node. Once streaming completes, the MV node
can safely mark the 20-row SSTable as obsolete. In this way, the extra row
in MV is removed.

The core idea is to reconstruct the MV data for base range (40–59) and MV
range (40–59) using the corresponding base table range as the source of
truth.


On Fri, May 9, 2025 at 2:26 PM David Capwell  wrote:

> The MV repair tool in Cassandra is intended to address inconsistencies
> that may occur in materialized views due to various factors. This component
> is the most complex and demanding part of the development effort,
> representing roughly 70% of the overall work.
>
>
> but I am very concerned about how the merkle tree comparisons are likely
> to work with wide partitions leading to massive fanout in ranges.
>
>
> As far as I can tell, bein

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-09 Thread David Capwell
> The MV repair tool in Cassandra is intended to address inconsistencies that 
> may occur in materialized views due to various factors. This component is the 
> most complex and demanding part of the development effort, representing 
> roughly 70% of the overall work.

> but I am very concerned about how the merkle tree comparisons are likely to 
> work with wide partitions leading to massive fanout in ranges. 

As far as I can tell, being based off Accord means you don’t need to care about 
repair, as Accord will manage the consistency for you; you can’t get out of 
sync.

Being based off accord also means you can deal with multiple partitions/tokens, 
where as LWT is limited to a single token.  I am not sure how the following 
would work with the proposed design and LWT

CREATE TABLE tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));
CREATE MATERIALIZED VIEW tbl2
AS SELECT * FROM tbl WHERE ck > 42 PRIMARY KEY(pk, ck)

— mutations
UPDATE tbl SET v=42 WHERE pk IN (0, 1) AND ck IN (50, 74); — this touches 2 
partition keys
BEGIN BATCH — also touches 2 partition keys
  INSERT INTO tbl (pk, ck, v) VALUES (0, 47, 0);
  INSERT INTO tbl (pk, ck, v) VALUES (1, 48, 0);
END BATCH



> On May 9, 2025, at 1:03 PM, Jeff Jirsa  wrote:
> 
> 
> 
>> On May 9, 2025, at 12:59 PM, Ariel Weisberg  wrote:
>> 
>> 
>> I am *big* fan of getting repair really working with MVs. It does seem 
>> problematic that the number of merkle trees will be equal to the number of 
>> ranges in the cluster and repair of MVs would become an all node operation.  
>> How would down nodes be handled and how many nodes would simultaneously 
>> working to validate a given base table range at once? How many base table 
>> ranges could simultaneously be repairing MVs?
>> 
>> If a row containing a column that creates an MV partition is deleted, and 
>> the MV isn't updated, then how does the merkle tree approach propagate the 
>> deletion to the MV? The CEP says that anti-compaction would remove extra 
>> rows, but I am not clear on how that works. When is anti-compaction 
>> performed in the repair process and what is/isn't included in the outputs?
> 
> 
> I thought about these two points last night after I sent my email.
> 
> There’s 2 things in this proposal that give me a lot of pause.
> 
> One is the lack of tombstones / deletions in the merle trees, which makes 
> properly dealing with writes/deletes/inconsistency very hard (afaict)
> 
> The second is the reality that repairing a single partition in the base table 
> may repair all hosts/ranges in the MV table, and vice versa. Basically 
> scanning either base or MV is effectively scanning the whole cluster (modulo 
> what you can avoid in the clean/dirty repaired sets). This makes me really, 
> really concerned with how it scales, and how likely it is to be able to 
> schedule automatically without blowing up. 
> 
> The paxos vs accord comments so far are interesting in that I think both 
> could be made to work, but I am very concerned about how the merkle tree 
> comparisons are likely to work with wide partitions leading to massive fanout 
> in ranges. 
> 
> 



Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-09 Thread Jeff Jirsa


> On May 9, 2025, at 12:59 PM, Ariel Weisberg  wrote:
> 
> 
> I am *big* fan of getting repair really working with MVs. It does seem 
> problematic that the number of merkle trees will be equal to the number of 
> ranges in the cluster and repair of MVs would become an all node operation.  
> How would down nodes be handled and how many nodes would simultaneously 
> working to validate a given base table range at once? How many base table 
> ranges could simultaneously be repairing MVs?
> 
> If a row containing a column that creates an MV partition is deleted, and the 
> MV isn't updated, then how does the merkle tree approach propagate the 
> deletion to the MV? The CEP says that anti-compaction would remove extra 
> rows, but I am not clear on how that works. When is anti-compaction performed 
> in the repair process and what is/isn't included in the outputs?


I thought about these two points last night after I sent my email.

There’s 2 things in this proposal that give me a lot of pause.

One is the lack of tombstones / deletions in the merle trees, which makes 
properly dealing with writes/deletes/inconsistency very hard (afaict)

The second is the reality that repairing a single partition in the base table 
may repair all hosts/ranges in the MV table, and vice versa. Basically scanning 
either base or MV is effectively scanning the whole cluster (modulo what you 
can avoid in the clean/dirty repaired sets). This makes me really, really 
concerned with how it scales, and how likely it is to be able to schedule 
automatically without blowing up. 

The paxos vs accord comments so far are interesting in that I think both could 
be made to work, but I am very concerned about how the merkle tree comparisons 
are likely to work with wide partitions leading to massive fanout in ranges. 




Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-09 Thread Ariel Weisberg
Hi,

Great to see MVs getting some attention and it's a good time to start 
addressing their shortcomings.

Looking at the details of the CEP it seems to describe Paxos as PaxosV1, but 
PaxosV2 works slightly differently (it can read during the prepare phase). I 
assume that supporting Paxos means supporting both V1 and V2 for materialized 
views?
 
As has been mentioned Accord doesn't have many restrictions on what you can do 
logically in a transaction. The only significant restrictions are that you must 
know the keys that are going to be read/written to when the transaction starts 
(known in this case) and you can only read once, process the result of the 
read, and then generate a single set of writes to apply. Here are some docs 
explaining how CQL is implemented on Accord 
https://github.com/aweisberg/cassandra-website/blob/20637/content/doc/trunk/cassandra/architecture/cql-on-accord.html

Why mandate `LOCAL_QUORUM` instead of using the consistency level requested by 
the application? If they want to use `LOCAL_QUORUM` they can always request it.

Using a transaction system as a better batch log for a key seems like a pretty 
reasonable way to make reads/writes for materialized views safer. A really good 
implementation of this would overlap the MV read with the read from the 
transaction system for unfinished updates and then discard or augment it if the 
MV read is incomplete. Accord will do this for you automatically just by virtue 
of supporting multi-key transactions.

I am *big* fan of getting repair really working with MVs. It does seem 
problematic that the number of merkle trees will be equal to the number of 
ranges in the cluster and repair of MVs would become an all node operation.  
How would down nodes be handled and how many nodes would simultaneously working 
to validate a given base table range at once? How many base table ranges could 
simultaneously be repairing MVs?

If a row containing a column that creates an MV partition is deleted, and the 
MV isn't updated, then how does the merkle tree approach propagate the deletion 
to the MV? The CEP says that anti-compaction would remove extra rows, but I am 
not clear on how that works. When is anti-compaction performed in the repair 
process and what is/isn't included in the outputs?

Thanks,
Ariel

On Tue, May 6, 2025, at 6:51 PM, Runtian Liu wrote:
> Hi everyone,
> 
> We’d like to propose a new Cassandra Enhancement Proposal: CEP-48: 
> First-Class Materialized View Support 
> .
> 
> This CEP focuses on addressing the long-standing consistency issues in the 
> current Materialized View (MV) implementation by introducing a new 
> architecture that keeps base tables and MVs reliably in sync. It also adds a 
> new validation and repair type to Cassandra’s repair process to support MV 
> repair based on the base table. The goal is to make MV a first-class, 
> production-ready feature that users can depend on—without relying on external 
> reconciliation tools or custom workarounds.
> 
> We’d really appreciate your feedback—please keep the discussion on this 
> mailing list thread.
> 
> 
> Thanks,
> Runtian
> 


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-09 Thread Benedict
I should add that I’m in favour of this proposal in principle, and support the proposal to utilise Paxos.On 9 May 2025, at 08:21, Benedict Elliott Smith  wrote:I’d also like to explore a bit further the isolation guarantees we’re promising with "Strict Consistency Mode” - and the protocol details. By strict, do we mean linearizable? Either way, we should state the guarantees explicitly so we can evaluate whether the protocol can meet them. Also, if the protocol is not linearisable, we should leave design space for a genuinely strict mode later.It isn’t clearly stated in the design document, but it seems to me that safety with this approach requires a SERIAL base-table read for every MV read to ensure the view is consistent with the base table. This means the MV cannot meaningfully replicate any data, only keys that can be used to consult the base table. Is that a reasonable inference for “strict" mode?Using LOCAL_SERIAL for this purpose (as it seems the document proposes) cannot provide strict guarantees, and mixing LOCAL_SERIAL with SERIAL is generally considered unsafe - so we need to explore a bit more in the design document what this means, but once we understand the isolation guarantees we're promising that will be easier.On 9 May 2025, at 02:13, Jeff Jirsa  wrote:Setting aside the paxos vs accord conversation (though admittedly my first question would have been “why not accord”), I’m curious from folks who have thought about this how you’re thinking about correctness of repairI ask because I have seen far more data resurrection cases than I have lost write cases, so repair here propagates that resurrection? Is that the expected primary behavior? I know repair also propagates resurrection in many cases (once tombstones purge), but has anyone running MVs in real life seen mismatches caused by lost writes instead of by something else (like resurrection)?On May 8, 2025, at 5:44 PM, Runtian Liu  wrote:Here’s my perspective:#1 Accord vs. LWT round tripsBased on the insights shared by the Accord experts, it appears that implementing MV using Accord can achieve a comparable number of round trips as the LWT solution proposed in CEP-48. Additionally, it seems that the number of WAN RTTs might be fewer than the LWT solution through Accord. This suggests that Accord is either equivalent or better in terms of performance for CEP-48.Given this, it seems appropriate to set aside performance as a deciding factor when evaluating LWT versus Accord. I've also updated the CEP-48 page to reflect this clarification.#2 Accord vs. LWT current stateAccord Accord is poised to significantly reshape Apache Cassandra's future and stands out as one of the most impactful developments on the horizon. The community is genuinely excited about its potential.That said, the recent mailing list update on Accord (CEP-15) highlights that substantial work remains to mature the protocol entirely. In addition, real-world testing is still needed to validate its readiness. Beyond that, users will require additional time to evaluate and adopt Cassandra 6.x in their environments.LWTOn the other hand, LWT has been proven and has been hitting production at scale for many years.#3 Dev work for CEP-48The CEP-48 design has two major components.Online path (CQL Mutations)This section focuses on the LWT code path where any mutation to a base table (via CQL insert, update, or delete) reliably triggers the corresponding materialized view (MV) update. The development effort required for this part is relatively limited, accounting for approximately 30% of the total work.If we need to implement this on Accord, this would be a similar effort as the LWT.Offline path (MV Data Repair)The MV repair tool in Cassandra is intended to address inconsistencies that may occur in materialized views due to various factors. This component is the most complex and demanding part of the development effort, representing roughly 70% of the overall work.#4 Accord is mentioned as a Future Alternative in CEP-48Accord has always been top of mind, and we genuinely appreciate the thought and effort that has gone into its design and implementation -  We’re excited about the changes, and if you look at the CEP-48 proposal, Accord is listed as a 'Future Alternative' — not as a 'Rejected Alternative' — to make clear that we continue to see value in its approach and are not opposed to it.Based on #1, #2, #3, and #4, here is my thinking:Scenario#1: CEP-15 prod takes longer than CEP-48 mergeSince we're starting with LWT, there is no dependency on the progress of CEP-15. This means the community can benefit from CEP-48 independently of CEP-15's timeline. Additionally, it's possible to backport the changes from trunk to the current broadly adopted Cassandra release (4.1.x), enabling adoption before upgrading to 6.x.Scenario#2: CEP-15 prod qualified before CEP-48 mergeAs noted in #3, developing on top of Accord is a relatively small effort of the overall CEP-48 scope. Therefore, we can imple

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-09 Thread Benedict Elliott Smith
I’d also like to explore a bit further the isolation guarantees we’re promising 
with "Strict Consistency Mode” - and the protocol details. By strict, do we 
mean linearizable? Either way, we should state the guarantees explicitly so we 
can evaluate whether the protocol can meet them. Also, if the protocol is not 
linearisable, we should leave design space for a genuinely strict mode later.

It isn’t clearly stated in the design document, but it seems to me that safety 
with this approach requires a SERIAL base-table read for every MV read to 
ensure the view is consistent with the base table. This means the MV cannot 
meaningfully replicate any data, only keys that can be used to consult the base 
table. Is that a reasonable inference for “strict" mode?

Using LOCAL_SERIAL for this purpose (as it seems the document proposes) cannot 
provide strict guarantees, and mixing LOCAL_SERIAL with SERIAL is generally 
considered unsafe - so we need to explore a bit more in the design document 
what this means, but once we understand the isolation guarantees we're 
promising that will be easier.


> On 9 May 2025, at 02:13, Jeff Jirsa  wrote:
> 
> Setting aside the paxos vs accord conversation (though admittedly my first 
> question would have been “why not accord”), I’m curious from folks who have 
> thought about this how you’re thinking about correctness of repair
> 
> I ask because I have seen far more data resurrection cases than I have lost 
> write cases, so repair here propagates that resurrection? Is that the 
> expected primary behavior? I know repair also propagates resurrection in many 
> cases (once tombstones purge), but has anyone running MVs in real life seen 
> mismatches caused by lost writes instead of by something else (like 
> resurrection)?
> 
> 
>> On May 8, 2025, at 5:44 PM, Runtian Liu  wrote:
>> 
>> 
>> Here’s my perspective:
>> 
>> #1 Accord vs. LWT round trips
>> 
>> Based on the insights shared by the Accord experts, it appears that 
>> implementing MV using Accord can achieve a comparable number of round trips 
>> as the LWT solution proposed in CEP-48. Additionally, it seems that the 
>> number of WAN RTTs might be fewer than the LWT solution through Accord. This 
>> suggests that Accord is either equivalent or better in terms of performance 
>> for CEP-48.
>> 
>> Given this, it seems appropriate to set aside performance as a deciding 
>> factor when evaluating LWT versus Accord. I've also updated the CEP-48 page 
>> to reflect this clarification.
>> 
>> #2 Accord vs. LWT current state
>> 
>> Accord 
>> 
>> Accord is poised to significantly reshape Apache Cassandra's future and 
>> stands out as one of the most impactful developments on the horizon. The 
>> community is genuinely excited about its potential.
>> 
>> That said, the recent mailing list update 
>>  on 
>> Accord (CEP-15) highlights that substantial work remains to mature the 
>> protocol entirely. In addition, real-world testing is still needed to 
>> validate its readiness. Beyond that, users will require additional time to 
>> evaluate and adopt Cassandra 6.x in their environments.
>> 
>> LWT
>> 
>> On the other hand, LWT has been proven and has been hitting production at 
>> scale for many years.
>> 
>> #3 Dev work for CEP-48
>> 
>> The CEP-48 design has two major components.
>> 
>> Online path (CQL Mutations)
>> 
>> This section focuses on the LWT code path where any mutation to a base table 
>> (via CQL insert, update, or delete) reliably triggers the corresponding 
>> materialized view (MV) update. The development effort required for this part 
>> is relatively limited, accounting for approximately 30% of the total work.
>> 
>> If we need to implement this on Accord, this would be a similar effort as 
>> the LWT.
>> 
>> Offline path (MV Data Repair)
>> 
>> The MV repair tool in Cassandra is intended to address inconsistencies that 
>> may occur in materialized views due to various factors. This component is 
>> the most complex and demanding part of the development effort, representing 
>> roughly 70% of the overall work.
>> 
>> #4 Accord is mentioned as a Future Alternative in CEP-48
>> 
>> Accord has always been top of mind, and we genuinely appreciate the thought 
>> and effort that has gone into its design and implementation -  We’re excited 
>> about the changes, and if you look at the CEP-48 proposal, Accord is listed 
>> as a 'Future Alternative' — not as a 'Rejected Alternative' — to make clear 
>> that we continue to see value in its approach and are not opposed to it.
>> 
>> 
>> 
>> Based on #1, #2, #3, and #4, here is my thinking:
>> 
>> Scenario#1: CEP-15 prod takes longer than CEP-48 merge
>> 
>> Since we're starting with LWT, there is no dependency on the progress of 
>> CEP-15. This means the community can benefit from CEP-48 independently of 
>> CEP-15's timeline. Additionally, it's possible to backport the changes from 
>> trunk t

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-08 Thread Jeff Jirsa
Setting aside the paxos vs accord conversation (though admittedly my first question would have been “why not accord”), I’m curious from folks who have thought about this how you’re thinking about correctness of repairI ask because I have seen far more data resurrection cases than I have lost write cases, so repair here propagates that resurrection? Is that the expected primary behavior? I know repair also propagates resurrection in many cases (once tombstones purge), but has anyone running MVs in real life seen mismatches caused by lost writes instead of by something else (like resurrection)?On May 8, 2025, at 5:44 PM, Runtian Liu  wrote:Here’s my perspective:#1 Accord vs. LWT round tripsBased on the insights shared by the Accord experts, it appears that implementing MV using Accord can achieve a comparable number of round trips as the LWT solution proposed in CEP-48. Additionally, it seems that the number of WAN RTTs might be fewer than the LWT solution through Accord. This suggests that Accord is either equivalent or better in terms of performance for CEP-48.Given this, it seems appropriate to set aside performance as a deciding factor when evaluating LWT versus Accord. I've also updated the CEP-48 page to reflect this clarification.#2 Accord vs. LWT current stateAccord Accord is poised to significantly reshape Apache Cassandra's future and stands out as one of the most impactful developments on the horizon. The community is genuinely excited about its potential.That said, the recent mailing list update on Accord (CEP-15) highlights that substantial work remains to mature the protocol entirely. In addition, real-world testing is still needed to validate its readiness. Beyond that, users will require additional time to evaluate and adopt Cassandra 6.x in their environments.LWTOn the other hand, LWT has been proven and has been hitting production at scale for many years.#3 Dev work for CEP-48The CEP-48 design has two major components.Online path (CQL Mutations)This section focuses on the LWT code path where any mutation to a base table (via CQL insert, update, or delete) reliably triggers the corresponding materialized view (MV) update. The development effort required for this part is relatively limited, accounting for approximately 30% of the total work.If we need to implement this on Accord, this would be a similar effort as the LWT.Offline path (MV Data Repair)The MV repair tool in Cassandra is intended to address inconsistencies that may occur in materialized views due to various factors. This component is the most complex and demanding part of the development effort, representing roughly 70% of the overall work.#4 Accord is mentioned as a Future Alternative in CEP-48Accord has always been top of mind, and we genuinely appreciate the thought and effort that has gone into its design and implementation -  We’re excited about the changes, and if you look at the CEP-48 proposal, Accord is listed as a 'Future Alternative' — not as a 'Rejected Alternative' — to make clear that we continue to see value in its approach and are not opposed to it.Based on #1, #2, #3, and #4, here is my thinking:Scenario#1: CEP-15 prod takes longer than CEP-48 mergeSince we're starting with LWT, there is no dependency on the progress of CEP-15. This means the community can benefit from CEP-48 independently of CEP-15's timeline. Additionally, it's possible to backport the changes from trunk to the current broadly adopted Cassandra release (4.1.x), enabling adoption before upgrading to 6.x.Scenario#2: CEP-15 prod qualified before CEP-48 mergeAs noted in #3, developing on top of Accord is a relatively small effort of the overall CEP-48 scope. Therefore, we can implement using Accord before merging CEP-48 into trunk, allowing us to forgo the LWT-based approach.Given that the work required to support Accord is relatively limited and that it would eliminate a dependency on a feature that is still maturing, proceeding with LWT is the most reliable path forward. Please feel free to share your thoughts.On Thu, May 8, 2025 at 9:00 AM Jon Haddad  wrote:Based on David and Blake’s responses, it sounds like we don’t need to block on anything. I realize you may be making a broader point, but in this instance it sounds like there’s nothing here preventing an accord based MV implementation. Now that i understand more about how it would be done, it also sounds a lot simpler. On Thu, May 8, 2025 at 8:50 AM Josh McKenzie  wrote:IMHO, focus should be on accord-based MVs.  Even if that means it's blocked on first adding support for multiple conditions.Strongly disagree here. We should develop features to be as loosely coupled w/one another as possible w/an eye towards future compatibility and leverage but not block development of one functionality on something else unless absolutely required for the feature to work (I'm defining "work" here as "hits user requirements with affordances consistent w/the re

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-08 Thread Runtian Liu
Here’s my perspective:

#1 Accord vs. LWT round trips

Based on the insights shared by the Accord experts, it appears that
implementing MV using Accord can achieve a comparable number of round trips
as the LWT solution proposed in CEP-48. Additionally, it seems that the
number of WAN RTTs might be fewer than the LWT solution through Accord.
This suggests that Accord is either equivalent or better in terms of
performance for CEP-48.

Given this, it seems appropriate to set aside performance as a deciding
factor when evaluating LWT versus Accord. I've also updated the CEP-48 page
to reflect this clarification.

#2 Accord vs. LWT current state

Accord

Accord is poised to significantly reshape Apache Cassandra's future and
stands out as one of the most impactful developments on the horizon. The
community is genuinely excited about its potential.

That said, the recent mailing list update
 on
Accord (CEP-15) highlights that substantial work remains to mature the
protocol entirely. In addition, real-world testing is still needed to
validate its readiness. Beyond that, users will require additional time to
evaluate and adopt Cassandra 6.x in their environments.

LWT

On the other hand, LWT has been proven and has been hitting production at
scale for many years.

#3 Dev work for CEP-48

The CEP-48 design has two major components.

   1.

   Online path (CQL Mutations)

This section focuses on the LWT code path where any mutation to a base
table (via CQL insert, update, or delete) reliably triggers the
corresponding materialized view (MV) update. The development effort
required for this part is relatively limited, accounting for approximately
30% of the total work.

If we need to implement this on Accord, this would be a similar effort as
the LWT.

   2.

   Offline path (MV Data Repair)

The MV repair tool in Cassandra is intended to address inconsistencies that
may occur in materialized views due to various factors. This component is
the most complex and demanding part of the development effort, representing
roughly 70% of the overall work.

#4 Accord is mentioned as a Future Alternative in CEP-48

Accord has always been top of mind, and we genuinely appreciate the thought
and effort that has gone into its design and implementation -  We’re
excited about the changes, and if you look at the CEP-48 proposal, Accord
is listed as a 'Future Alternative' — not as a 'Rejected Alternative' — to
make clear that we continue to see value in its approach and are not
opposed to it.


Based on #1, #2, #3, and #4, here is my thinking:

*Scenario#1*: CEP-15 prod takes longer than CEP-48 merge

Since we're starting with LWT, there is no dependency on the progress of
CEP-15. This means the community can benefit from CEP-48 independently of
CEP-15's timeline. Additionally, it's possible to backport the changes from
trunk to the current broadly adopted Cassandra release (4.1.x), enabling
adoption before upgrading to 6.x.

*Scenario#2*: CEP-15 prod qualified before CEP-48 merge

As noted in #3, developing on top of Accord is a relatively small effort of
the overall CEP-48 scope. Therefore, we can implement using Accord before
merging CEP-48 into trunk, allowing us to forgo the LWT-based approach.

Given that the work required to support Accord is relatively limited and
that it would eliminate a dependency on a feature that is still maturing,
proceeding with LWT is the most reliable path forward. Please feel free to
share your thoughts.


On Thu, May 8, 2025 at 9:00 AM Jon Haddad  wrote:

> Based on David and Blake’s responses, it sounds like we don’t need to
> block on anything.
>
> I realize you may be making a broader point, but in this instance it
> sounds like there’s nothing here preventing an accord based MV
> implementation. Now that i understand more about how it would be done, it
> also sounds a lot simpler.
>
>
>
>
> On Thu, May 8, 2025 at 8:50 AM Josh McKenzie  wrote:
>
>> IMHO, focus should be on accord-based MVs.  Even if that means it's
>> blocked on first adding support for multiple conditions.
>>
>> Strongly disagree here. We should develop features to be as loosely
>> coupled w/one another as possible w/an eye towards future compatibility and
>> leverage but not block development of one functionality on something else
>> unless absolutely required for the feature to work (I'm defining "work"
>> here as "hits user requirements with affordances consistent w/the rest of
>> our ecosystem").
>>
>> With the logic of deferring to another feature, it would have been quite
>> reasonable for someone to make this same statement back in fall of '23 when
>> we were discussing delaying 5.0 for Accord's merge. But things come up, the
>> space we're in is complex, and cutting edge distributed things are Hard.
>>
>>
>> On Thu, May 8, 2025, at 11:13 AM, Mick Semb Wever wrote:
>>
>>
>>
>>
>> Curious what others think though.  I'm +1 on the spirit of getting MVs to
>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-08 Thread Jon Haddad
Based on David and Blake’s responses, it sounds like we don’t need to block
on anything.

I realize you may be making a broader point, but in this instance it sounds
like there’s nothing here preventing an accord based MV implementation. Now
that i understand more about how it would be done, it also sounds a lot
simpler.




On Thu, May 8, 2025 at 8:50 AM Josh McKenzie  wrote:

> IMHO, focus should be on accord-based MVs.  Even if that means it's
> blocked on first adding support for multiple conditions.
>
> Strongly disagree here. We should develop features to be as loosely
> coupled w/one another as possible w/an eye towards future compatibility and
> leverage but not block development of one functionality on something else
> unless absolutely required for the feature to work (I'm defining "work"
> here as "hits user requirements with affordances consistent w/the rest of
> our ecosystem").
>
> With the logic of deferring to another feature, it would have been quite
> reasonable for someone to make this same statement back in fall of '23 when
> we were discussing delaying 5.0 for Accord's merge. But things come up, the
> space we're in is complex, and cutting edge distributed things are Hard.
>
>
> On Thu, May 8, 2025, at 11:13 AM, Mick Semb Wever wrote:
>
>
>
>
> Curious what others think though.  I'm +1 on the spirit of getting MVs to
> a stable point, but not convinced this is the best approach.
>
>
>
>
> IMHO, focus should be on accord-based MVs.  Even if that means it's
> blocked on first adding support for multiple conditions.
>
>
>


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-08 Thread Josh McKenzie
> IMHO, focus should be on accord-based MVs.  Even if that means it's blocked 
> on first adding support for multiple conditions.
> 
Strongly disagree here. We should develop features to be as loosely coupled 
w/one another as possible w/an eye towards future compatibility and leverage 
but not block development of one functionality on something else unless 
absolutely required for the feature to work (I'm defining "work" here as "hits 
user requirements with affordances consistent w/the rest of our ecosystem").

With the logic of deferring to another feature, it would have been quite 
reasonable for someone to make this same statement back in fall of '23 when we 
were discussing delaying 5.0 for Accord's merge. But things come up, the space 
we're in is complex, and cutting edge distributed things are Hard.


On Thu, May 8, 2025, at 11:13 AM, Mick Semb Wever wrote:
> 
>  
>> Curious what others think though.  I'm +1 on the spirit of getting MVs to a 
>> stable point, but not convinced this is the best approach.
>> 
> 
> 
> 
> 
> 
> IMHO, focus should be on accord-based MVs.  Even if that means it's blocked 
> on first adding support for multiple conditions.
> 


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-08 Thread Mick Semb Wever
> Curious what others think though.  I'm +1 on the spirit of getting MVs to
> a stable point, but not convinced this is the best approach.
>


IMHO, focus should be on accord-based MVs.  Even if that means it's blocked
on first adding support for multiple conditions.


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-07 Thread David Capwell
> I think the primary argument *against* Accord is that the syntax isn't 
> expressive enough to be able to address multiple conditions in MVs.  For each 
> field that's updated, you'll need to know if you want to add that update into 
> the transaction, and you'd need to check if it was modified.  Currently 
> Accord only support a single conditional on the entire transaction.

What Blake said, we limited the CQL API for the first drop to get feedback on 
things (such as what the default result return be).  But internal logic is free 
to do w/e it wants, so if you need

If (isUpdated(field1)
  INSERT
If (isUpdated(field2))
  INSERT
If (isUpdated(field4)
  INSERT

Nothing should get in the way, this is internal so we don’t have to worry about 
the public API.

> On May 7, 2025, at 9:31 AM, Blake Eggleston  wrote:
> 
> > Yes, you need to read the original row before the transaction begins in 
> > order to get the initial state, but could be done at local one by the 
> > coordinator, reading itself.  The performance overhead of an additional, 
> > local one read should be significantly less than a Paxos transaction that 
> > has to do additional round trips.  
> 
> Assuming the trick of overloading the paxos propose phase with a view update 
> mutation doesn’t have correctness edge cases, you may be able to get away 
> with doing something similar at the Accord apply phase, eliminating the need 
> for 2 operations.
> 
> > I think the primary argument *against* Accord is that the syntax isn't 
> > expressive enough to be able to address multiple conditions in MVs.  For 
> > each field that's updated, you'll need to know if you want to add that 
> > update into the transaction, and you'd need to check if it was modified.  
> > Currently Accord only support a single conditional on the entire 
> > transaction.
> 
> That's a limitation of the CQL syntax, but not of accord itself. There's 
> nothing preventing internal features from providing their own Txn 
> implementation to be coordinated.
> 
> On Wed, May 7, 2025, at 9:00 AM, Jon Haddad wrote:
>> Glad to see folks are looking to improve MVs.  Definitely one of the areas 
>> we need some attention paid to.
>> 
>> Do you have a patch already for this?  We haven't had a discussion yet about 
>> winding down new development in trunk but IMO we should probably stop 
>> merging big things in soon and focus on getting the release out.
>> 
>> Since you're planning on making this opt-in, I think it might be better to 
>> leverage accord transactions.  The code should be quite a bit simpler on the 
>> write path.  Accord should require fewer round trips and work 
>> *significantly* better when there's multiple data centers.
>> 
>> > While this design enables atomic updates across base tables and MVs, it is 
>> > inefficient—even in the common (happy path) case—because it involves 
>> > reading the base table row twice: once before the transaction begins and 
>> > again during the transaction itself. This introduces unnecessary overhead.
>> Yes, you need to read the original row before the transaction begins in 
>> order to get the initial state, but could be done at local one by the 
>> coordinator, reading itself.  The performance overhead of an additional, 
>> local one read should be significantly less than a Paxos transaction that 
>> has to do additional round trips.  So at this point, I'm not convinced that 
>> performance of Accord transactions will be any more than Paxos ones, in fact 
>> I think it's probably the opposite.
>> 
>> From a reliability perspective, you've made a good point that Accord is new, 
>> but you're also proposing this be opt-in.  I think if you're going to make 
>> it opt-in anyways, we might as well go with something that isn't going to be 
>> considered tech debt as soon as it's merged.
>> 
>> I think the primary argument *against* Accord is that the syntax isn't 
>> expressive enough to be able to address multiple conditions in MVs.  For 
>> each field that's updated, you'll need to know if you want to add that 
>> update into the transaction, and you'd need to check if it was modified.  
>> Currently Accord only support a single conditional on the entire transaction.
>> 
>> From a project perspective, I'd *much* rather see improvements to the 
>> expressiveness of transactions and gives us a long term solution, than 
>> something that we will immediately want to migrate off of after a single 
>> version. 
>> 
>> Curious what others think though.  I'm +1 on the spirit of getting MVs to a 
>> stable point, but not convinced this is the best approach.
>> 
>> Jon
>> 
>> 
>> On Wed, May 7, 2025 at 1:45 AM guo Maxwell > > wrote:
>> After thinking about it, if you want to use accord for synchronization in 
>> the future, you need to modify the base table attribute " transactional_mode 
>> = 'full' ".
>> If the user's base table does not want to use accord, do you plan to force 
>> the modification of this attribut

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-07 Thread Blake Eggleston
> Yes, you need to read the original row before the transaction begins in order 
> to get the initial state, but could be done at local one by the coordinator, 
> reading itself.  The performance overhead of an additional, local one read 
> should be significantly less than a Paxos transaction that has to do 
> additional round trips.  

Assuming the trick of overloading the paxos propose phase with a view update 
mutation doesn’t have correctness edge cases, you may be able to get away with 
doing something similar at the Accord apply phase, eliminating the need for 2 
operations.

> I think the primary argument *against* Accord is that the syntax isn't 
> expressive enough to be able to address multiple conditions in MVs.  For each 
> field that's updated, you'll need to know if you want to add that update into 
> the transaction, and you'd need to check if it was modified.  Currently 
> Accord only support a single conditional on the entire transaction.

That's a limitation of the CQL syntax, but not of accord itself. There's 
nothing preventing internal features from providing their own Txn 
implementation to be coordinated.

On Wed, May 7, 2025, at 9:00 AM, Jon Haddad wrote:
> Glad to see folks are looking to improve MVs.  Definitely one of the areas we 
> need some attention paid to.
> 
> Do you have a patch already for this?  We haven't had a discussion yet about 
> winding down new development in trunk but IMO we should probably stop merging 
> big things in soon and focus on getting the release out.
> 
> Since you're planning on making this opt-in, I think it might be better to 
> leverage accord transactions.  The code should be quite a bit simpler on the 
> write path.  Accord should require fewer round trips and work *significantly* 
> better when there's multiple data centers.
> 
> > While this design enables atomic updates across base tables and MVs, it is 
> > inefficient—even in the common (happy path) case—because it involves 
> > reading the base table row twice: once before the transaction begins and 
> > again during the transaction itself. This introduces unnecessary overhead.
> Yes, you need to read the original row before the transaction begins in order 
> to get the initial state, but could be done at local one by the coordinator, 
> reading itself.  The performance overhead of an additional, local one read 
> should be significantly less than a Paxos transaction that has to do 
> additional round trips.  So at this point, I'm not convinced that performance 
> of Accord transactions will be any more than Paxos ones, in fact I think it's 
> probably the opposite.
> 
> From a reliability perspective, you've made a good point that Accord is new, 
> but you're also proposing this be opt-in.  I think if you're going to make it 
> opt-in anyways, we might as well go with something that isn't going to be 
> considered tech debt as soon as it's merged.
> 
> I think the primary argument *against* Accord is that the syntax isn't 
> expressive enough to be able to address multiple conditions in MVs.  For each 
> field that's updated, you'll need to know if you want to add that update into 
> the transaction, and you'd need to check if it was modified.  Currently 
> Accord only support a single conditional on the entire transaction.
> 
> From a project perspective, I'd *much* rather see improvements to the 
> expressiveness of transactions and gives us a long term solution, than 
> something that we will immediately want to migrate off of after a single 
> version. 
> 
> Curious what others think though.  I'm +1 on the spirit of getting MVs to a 
> stable point, but not convinced this is the best approach.
> 
> Jon
> 
> 
> On Wed, May 7, 2025 at 1:45 AM guo Maxwell  wrote:
>> After thinking about it, if you want to use accord for synchronization in 
>> the future, you need to modify the base table attribute " transactional_mode 
>> = 'full' ".
>> If the user's base table does not want to use accord, do you plan to force 
>> the modification of this attribute?
>> 
>> Runtian Liu  于2025年5月7日周三 12:08写道:
>>> Thanks for the questions. A few clarifications:
>>> 
>>>  • *Performance impact & opt-in model:* The new MV synchronization 
>>> mechanism is fully opt-in. We understand that LWT-backed writes may 
>>> introduce performance overhead, so users who prefer higher throughput over 
>>> strict consistency can continue using the existing MV implementation. The 
>>> new strict consistency mode can be toggled via a table-level option.
>>> 
>>>  • *Support for both implementations:* Even if this CEP is accepted, the 
>>> current MV behavior will remain available. Users will have the flexibility 
>>> to enable or disable the new mode as needed.
>>> 
>>>  • *Repair frequency:* MV inconsistency detection and repair is integrated 
>>> with Cassandra’s existing repair framework. It can be triggered manually 
>>> via `nodetool` or scheduled using the auto-repair infrastructure (per 
>>> CEP-37), allowing opera

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-07 Thread Jon Haddad
Glad to see folks are looking to improve MVs.  Definitely one of the areas
we need some attention paid to.

Do you have a patch already for this?  We haven't had a discussion yet
about winding down new development in trunk but IMO we should probably stop
merging big things in soon and focus on getting the release out.

Since you're planning on making this opt-in, I think it might be better to
leverage accord transactions.  The code should be quite a bit simpler on
the write path.  Accord should require fewer round trips and work
*significantly* better when there's multiple data centers.

> While this design enables atomic updates across base tables and MVs, it
is inefficient—even in the common (happy path) case—because it involves
reading the base table row twice: once before the transaction begins and
again during the transaction itself. This introduces unnecessary overhead.

Yes, you need to read the original row before the transaction begins in
order to get the initial state, but could be done at local one by the
coordinator, reading itself.  The performance overhead of an additional,
local one read should be significantly less than a Paxos transaction that
has to do additional round trips.  So at this point, I'm not convinced that
performance of Accord transactions will be any more than Paxos ones, in
fact I think it's probably the opposite.

>From a reliability perspective, you've made a good point that Accord is
new, but you're also proposing this be opt-in.  I think if you're going to
make it opt-in anyways, we might as well go with something that isn't going
to be considered tech debt as soon as it's merged.
I think the primary argument *against* Accord is that the syntax isn't
expressive enough to be able to address multiple conditions in MVs.  For
each field that's updated, you'll need to know if you want to add that
update into the transaction, and you'd need to check if it was modified.
Currently Accord only support a single conditional on the entire
transaction.

>From a project perspective, I'd *much* rather see improvements to the
expressiveness of transactions and gives us a long term solution, than
something that we will immediately want to migrate off of after a single
version.

Curious what others think though.  I'm +1 on the spirit of getting MVs to a
stable point, but not convinced this is the best approach.

Jon


On Wed, May 7, 2025 at 1:45 AM guo Maxwell  wrote:

> After thinking about it, if you want to use accord for synchronization in
> the future, you need to modify the base table attribute "
> transactional_mode = 'full' ".
> If the user's base table does not want to use accord, do you plan to force
> the modification of this attribute?
>
> Runtian Liu  于2025年5月7日周三 12:08写道:
>
>> Thanks for the questions. A few clarifications:
>>
>>-
>>
>>*Performance impact & opt-in model:* The new MV synchronization
>>mechanism is fully opt-in. We understand that LWT-backed writes may
>>introduce performance overhead, so users who prefer higher throughput over
>>strict consistency can continue using the existing MV implementation. The
>>new strict consistency mode can be toggled via a table-level option.
>>-
>>
>>*Support for both implementations:* Even if this CEP is accepted, the
>>current MV behavior will remain available. Users will have the flexibility
>>to enable or disable the new mode as needed.
>>-
>>
>>*Repair frequency:* MV inconsistency detection and repair is
>>integrated with Cassandra’s existing repair framework. It can be triggered
>>manually via nodetool or scheduled using the auto-repair
>>infrastructure (per CEP-37), allowing operators to control how frequently
>>repairs run.
>>
>>
>> On Tue, May 6, 2025 at 7:09 PM guo Maxwell  wrote:
>>
>>> If the entire write operation involves additional LWTs to change the MV,
>>> it is uncertain whether users can accept the performance loss of such write
>>> operations.
>>>
>>> If this CEP is finally accepted, I think users should at least be given
>>> the choice of whether to use the old method or the new method, because
>>> after all, some users pursue performance rather than strict data
>>> consistency(we can provide the ability of disabling or enabling the new mv
>>> mv synchronization mechanism).
>>>
>>> Another question : What  is the frequency of inconsistency detection and
>>> repair  for mv and base table ?
>>>
>>> Runtian Liu  于2025年5月7日周三 06:51写道:
>>>
 Hi everyone,

 We’d like to propose a new Cassandra Enhancement Proposal: CEP-48:
 First-Class Materialized View Support
 
 .

 This CEP focuses on addressing the long-standing consistency issues in
 the current Materialized View (MV) implementation by introducing a new
 architecture that keeps base tables and MVs reliably in sync. It also adds
 a new validation and repa

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-07 Thread guo Maxwell
After thinking about it, if you want to use accord for synchronization in
the future, you need to modify the base table attribute "
transactional_mode = 'full' ".
If the user's base table does not want to use accord, do you plan to force
the modification of this attribute?

Runtian Liu  于2025年5月7日周三 12:08写道:

> Thanks for the questions. A few clarifications:
>
>-
>
>*Performance impact & opt-in model:* The new MV synchronization
>mechanism is fully opt-in. We understand that LWT-backed writes may
>introduce performance overhead, so users who prefer higher throughput over
>strict consistency can continue using the existing MV implementation. The
>new strict consistency mode can be toggled via a table-level option.
>-
>
>*Support for both implementations:* Even if this CEP is accepted, the
>current MV behavior will remain available. Users will have the flexibility
>to enable or disable the new mode as needed.
>-
>
>*Repair frequency:* MV inconsistency detection and repair is
>integrated with Cassandra’s existing repair framework. It can be triggered
>manually via nodetool or scheduled using the auto-repair
>infrastructure (per CEP-37), allowing operators to control how frequently
>repairs run.
>
>
> On Tue, May 6, 2025 at 7:09 PM guo Maxwell  wrote:
>
>> If the entire write operation involves additional LWTs to change the MV,
>> it is uncertain whether users can accept the performance loss of such write
>> operations.
>>
>> If this CEP is finally accepted, I think users should at least be given
>> the choice of whether to use the old method or the new method, because
>> after all, some users pursue performance rather than strict data
>> consistency(we can provide the ability of disabling or enabling the new mv
>> mv synchronization mechanism).
>>
>> Another question : What  is the frequency of inconsistency detection and
>> repair  for mv and base table ?
>>
>> Runtian Liu  于2025年5月7日周三 06:51写道:
>>
>>> Hi everyone,
>>>
>>> We’d like to propose a new Cassandra Enhancement Proposal: CEP-48:
>>> First-Class Materialized View Support
>>> 
>>> .
>>>
>>> This CEP focuses on addressing the long-standing consistency issues in
>>> the current Materialized View (MV) implementation by introducing a new
>>> architecture that keeps base tables and MVs reliably in sync. It also adds
>>> a new validation and repair type to Cassandra’s repair process to support
>>> MV repair based on the base table. The goal is to make MV a first-class,
>>> production-ready feature that users can depend on—without relying on
>>> external reconciliation tools or custom workarounds.
>>>
>>> We’d really appreciate your feedback—please keep the discussion on this
>>> mailing list thread.
>>>
>>> Thanks,
>>> Runtian
>>>
>>


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-06 Thread Runtian Liu
Thanks for the questions. A few clarifications:

   -

   *Performance impact & opt-in model:* The new MV synchronization
   mechanism is fully opt-in. We understand that LWT-backed writes may
   introduce performance overhead, so users who prefer higher throughput over
   strict consistency can continue using the existing MV implementation. The
   new strict consistency mode can be toggled via a table-level option.
   -

   *Support for both implementations:* Even if this CEP is accepted, the
   current MV behavior will remain available. Users will have the flexibility
   to enable or disable the new mode as needed.
   -

   *Repair frequency:* MV inconsistency detection and repair is integrated
   with Cassandra’s existing repair framework. It can be triggered manually
   via nodetool or scheduled using the auto-repair infrastructure (per
   CEP-37), allowing operators to control how frequently repairs run.


On Tue, May 6, 2025 at 7:09 PM guo Maxwell  wrote:

> If the entire write operation involves additional LWTs to change the MV,
> it is uncertain whether users can accept the performance loss of such write
> operations.
>
> If this CEP is finally accepted, I think users should at least be given
> the choice of whether to use the old method or the new method, because
> after all, some users pursue performance rather than strict data
> consistency(we can provide the ability of disabling or enabling the new mv
> mv synchronization mechanism).
>
> Another question : What  is the frequency of inconsistency detection and
> repair  for mv and base table ?
>
> Runtian Liu  于2025年5月7日周三 06:51写道:
>
>> Hi everyone,
>>
>> We’d like to propose a new Cassandra Enhancement Proposal: CEP-48:
>> First-Class Materialized View Support
>> 
>> .
>>
>> This CEP focuses on addressing the long-standing consistency issues in
>> the current Materialized View (MV) implementation by introducing a new
>> architecture that keeps base tables and MVs reliably in sync. It also adds
>> a new validation and repair type to Cassandra’s repair process to support
>> MV repair based on the base table. The goal is to make MV a first-class,
>> production-ready feature that users can depend on—without relying on
>> external reconciliation tools or custom workarounds.
>>
>> We’d really appreciate your feedback—please keep the discussion on this
>> mailing list thread.
>>
>> Thanks,
>> Runtian
>>
>


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-06 Thread guo Maxwell
If the entire write operation involves additional LWTs to change the MV, it
is uncertain whether users can accept the performance loss of such write
operations.

If this CEP is finally accepted, I think users should at least be given the
choice of whether to use the old method or the new method, because after
all, some users pursue performance rather than strict data consistency(we
can provide the ability of disabling or enabling the new mv mv
synchronization mechanism).

Another question : What  is the frequency of inconsistency detection and
repair  for mv and base table ?

Runtian Liu  于2025年5月7日周三 06:51写道:

> Hi everyone,
>
> We’d like to propose a new Cassandra Enhancement Proposal: CEP-48:
> First-Class Materialized View Support
> 
> .
>
> This CEP focuses on addressing the long-standing consistency issues in the
> current Materialized View (MV) implementation by introducing a new
> architecture that keeps base tables and MVs reliably in sync. It also adds
> a new validation and repair type to Cassandra’s repair process to support
> MV repair based on the base table. The goal is to make MV a first-class,
> production-ready feature that users can depend on—without relying on
> external reconciliation tools or custom workarounds.
>
> We’d really appreciate your feedback—please keep the discussion on this
> mailing list thread.
>
> Thanks,
> Runtian
>