[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-10-04 Thread sridhar nemani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546906#comment-15546906
 ] 

sridhar nemani edited comment on CASSANDRA-8844 at 10/4/16 10:46 PM:
-

I am fairly new to Cassandra. I have a requirement to be able to read any 
changes to tables, as in inserts deletes or updates from a given timestamp. I 
believe the new implementation of CDC should help me with this.
However, with CDC enabled, I want to know if there is yet a way to read the 
data inserts,updates or deletes to a table through CQL. I do see 
implementations of CommitLogReader. But, I want to know if it is possible to 
read the changes using CQL. If yes, how?

Please advise.

Thanks.


was (Author: sridhar.nem...@whamtech.com):
I am fairly new to Cassandra. I have a requirement to be able to read any 
changes to tables, as in inserts deletes or updates from a given timestamp. I 
believe the new implementation of CDC should help me with this.
However, I want to know if there is yet a way to read the data inserts,updates 
or deleted to a tables for which CDC has been enabled through CQL. I do see 
implementations of CommitLogReader. But, I want to know if it is possible to 
read the changes using CQL. 

Please advise.

Thanks.

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.8
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-06-16 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333832#comment-15333832
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 6/16/16 2:03 PM:
-

Switching between C# and Java everyday has its costs.

Fixed that, tidied up NEWS.txt (spacing and ordering on Upgrading and 
Deprecation), and 
[committed|https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=commit;h=e31e216234c6b57a531cae607e0355666007deb2].

Thanks for the assist [~carlyeks] and [~blambov]!

I'll be creating a follow-up meta ticket w/subtasks from all the stuff that 
came up here that we deferred and link that to this ticket, as well as moving 
the link to CASSANDRA-11957 over there.


was (Author: joshuamckenzie):
Switching between C# and Java everyday has its costs.

Fixed that, tidied up NEWS.txt (spacing and ordering on Upgrading and 
Deprecation), and 
[committed|https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=commit;h=5dcab286ca0fcd9a71e28dad805f028362572e21].

Thanks for the assist [~carlyeks] and [~blambov]!

I'll be creating a follow-up meta ticket w/subtasks from all the stuff that 
came up here that we deferred and link that to this ticket, as well as moving 
the link to CASSANDRA-11957 over there.

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.8
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-05-16 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285499#comment-15285499
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 5/17/16 12:29 AM:
--

If either of you ([~carlyeks] / [~blambov]) have started reviewing, there are a 
couple of logical flaws in the way I've used the value of "combined un-flushed 
cdc-containing segment size and cdc_raw". Currently a non-cdc mutation could 
fail an allocation, advance in allocatingFrom(), and allow subsequent cdc-based 
mutations to succeed since the new code doesn't check for {{atCapacity}} until 
the case of allocation failure. The current logic strictly precludes allocation 
of a new CommitLogSegment by a cdc-containing Mutation allocation so it works 
when tested on cdc-only streams of Mutations but not mixed; I'll be writing a 
unit test to prove that shortly. Problem #2: if we track un-flushed full 
cdc-containing segment size in {{atCapacity}} and use that as part of metric to 
reject CDC-containing Mutations *before* that allocation attempt, we would then 
prematurely reject cdc mutations in the final CommitLogSegment created in the 
chain before filling it.

I'm going to need to spend some more time thinking about this. My initial hunch 
is that we may be unable to track un-flushed segment size w/CDC data in them as 
a meaningful marker of future CDC-data, thus meaning we cannot guarantee 
adherence to the user-specified disk-space restrictions for CDC due to 
in-flight data not yet being counted.  As new segment allocation takes place in 
the management thread and the current logic is strongly coupled to the 
invariant that new segment allocation always succeeds (even if back-pressured 
by compression buffer usage), the approach of forcibly failing is less 
palatable to me than us being a little loose with our interpretation of 
cdc_total_space_in_mb by 1-N segment units, assuming N tends to be low single 
digits leading to <5% violation in the default case. This should hold true 
unless flushing gets wildly backed up relative to ingest of writes; I don't 
know enough about that code to speak to that but will likely read into it a bit.

Anyway - figured I'd point that out in case either of you came across it and 
registered it mentally as a concern or if either of you have any immediate 
ideas on this topic.

(edit x2: Thinking is hard.)


was (Author: joshuamckenzie):
If either of you ([~carlyeks] / [~blambov]) have started reviewing, there are a 
couple of logical flaws in the way I've used the value of "combined un-flushed 
cdc-containing segment size and cdc_raw". Currently a non-cdc mutation could 
succeed in an allocation during a full CDC time-frame while cdc allocations 
fail, advance in allocatingFrom(), and allow subsequent cdc-based mutations to 
succeed since the new code doesn't check for {{atCapacity}} until the case of 
allocation failure. The current logic strictly precludes allocation of a new 
CommitLogSegment by a cdc-containing Mutation allocation so it works when 
tested on cdc-only streams of Mutations but not mixed; I'll be writing a unit 
test to prove that shortly. Problem #2: if we track un-flushed full 
cdc-containing segment size in {{atCapacity}} and use that as part of metric to 
reject CDC-containing Mutations *before* that allocation attempt, we would then 
prematurely reject cdc mutations in the final CommitLogSegment created in the 
chain before filling it.

I'm going to need to spend some more time thinking about this. My initial hunch 
is that we may be unable to track un-flushed segment size w/CDC data in them as 
a meaningful marker of future CDC-data, thus meaning we cannot guarantee 
adherence to the user-specified disk-space restrictions for CDC due to 
in-flight data not yet being counted.  As new segment allocation takes place in 
the management thread and the current logic is strongly coupled to the 
invariant that new segment allocation always succeeds (even if back-pressured 
by compression buffer usage), the approach of forcibly failing is less 
palatable to me than us being a little loose with our interpretation of 
cdc_total_space_in_mb by 1-N segment units, assuming N tends to be low single 
digits leading to <5% violation in the default case. This should hold true 
unless flushing gets wildly backed up relative to ingest of writes; I don't 
know enough about that code to speak to that but will likely read into it a bit.

Anyway - figured I'd point that out in case either of you came across it and 
registered it mentally as a concern or if either of you have any immediate 
ideas on this topic.

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-05-16 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285499#comment-15285499
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 5/17/16 12:28 AM:
--

If either of you ([~carlyeks] / [~blambov]) have started reviewing, there are a 
couple of logical flaws in the way I've used the value of "combined un-flushed 
cdc-containing segment size and cdc_raw". Currently a non-cdc mutation could 
succeed in an allocation during a full CDC time-frame while cdc allocations 
fail, advance in allocatingFrom(), and allow subsequent cdc-based mutations to 
succeed since the new code doesn't check for {{atCapacity}} until the case of 
allocation failure. The current logic strictly precludes allocation of a new 
CommitLogSegment by a cdc-containing Mutation allocation so it works when 
tested on cdc-only streams of Mutations but not mixed; I'll be writing a unit 
test to prove that shortly. Problem #2: if we track un-flushed full 
cdc-containing segment size in {{atCapacity}} and use that as part of metric to 
reject CDC-containing Mutations *before* that allocation attempt, we would then 
prematurely reject cdc mutations in the final CommitLogSegment created in the 
chain before filling it.

I'm going to need to spend some more time thinking about this. My initial hunch 
is that we may be unable to track un-flushed segment size w/CDC data in them as 
a meaningful marker of future CDC-data, thus meaning we cannot guarantee 
adherence to the user-specified disk-space restrictions for CDC due to 
in-flight data not yet being counted.  As new segment allocation takes place in 
the management thread and the current logic is strongly coupled to the 
invariant that new segment allocation always succeeds (even if back-pressured 
by compression buffer usage), the approach of forcibly failing is less 
palatable to me than us being a little loose with our interpretation of 
cdc_total_space_in_mb by 1-N segment units, assuming N tends to be low single 
digits leading to <5% violation in the default case. This should hold true 
unless flushing gets wildly backed up relative to ingest of writes; I don't 
know enough about that code to speak to that but will likely read into it a bit.

Anyway - figured I'd point that out in case either of you came across it and 
registered it mentally as a concern or if either of you have any immediate 
ideas on this topic.


was (Author: joshuamckenzie):
If either of you ([~carlyeks] / [~blambov]) have started reviewing, there are a 
couple of logical flaws in the way I've used the value of "combined un-flushed 
cdc-containing segment size and cdc_raw". Currently a non-cdc mutation could 
fail an allocation, advance in allocatingFrom(), and allow subsequent cdc-based 
mutations to succeed since the new code doesn't check for {{atCapacity}} until 
the case of allocation failure. The current logic strictly precludes allocation 
of a new CommitLogSegment by a cdc-containing Mutation allocation so it works 
when tested on cdc-only streams of Mutations but not mixed; I'll be writing a 
unit test to prove that shortly. Problem #2: if we track un-flushed full 
cdc-containing segment size in {{atCapacity}} and use that as part of metric to 
reject CDC-containing Mutations *before* that allocation attempt, we would then 
prematurely reject cdc mutations in the final CommitLogSegment created in the 
chain before filling it.

I'm going to need to spend some more time thinking about this. My initial hunch 
is that we may be unable to track un-flushed segment size w/CDC data in them as 
a meaningful marker of future CDC-data, thus meaning we cannot guarantee 
adherence to the user-specified disk-space restrictions for CDC due to 
in-flight data not yet being counted.  As new segment allocation takes place in 
the management thread and the current logic is strongly coupled to the 
invariant that new segment allocation always succeeds (even if back-pressured 
by compression buffer usage), the approach of forcibly failing is less 
palatable to me than us being a little loose with our interpretation of 
cdc_total_space_in_mb by 1-N segment units, assuming N tends to be low single 
digits leading to <5% violation in the default case. This should hold true 
unless flushing gets wildly backed up relative to ingest of writes; I don't 
know enough about that code to speak to that but will likely read into it a bit.

Anyway - figured I'd point that out in case either of you came across it and 
registered it mentally as a concern or if either of you have any immediate 
ideas on this topic.

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-05-02 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266676#comment-15266676
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 5/2/16 9:15 PM:


The issue we're discussing here is less about policy for handling CDC failures, 
and more about that policy impacting both CDC and non-CDC writes unless we 
distinguish whether a Mutation contains CDC-enabled CF in them at write-time or 
not.

If we treat all Mutations equally, we would apply that policy to both CDC and 
non-CDC enabled writes, so CDC space being filled / backpressure would reject 
all writes on the node.

edit: On re-reading my comment, I want to make sure you don't think I'm 
dismissing the "CDC failure policy" portion of the discussion. We don't have 
that in the v1 spec but it should be a relatively easy addition after we get 
the basic framework in.


was (Author: joshuamckenzie):
The issue we're discussing here is less about policy for handling CDC failures, 
and more about that policy impacting both CDC and non-CDC writes unless we 
distinguish whether a Mutation contains CDC-enabled CF in them at write-time or 
not.

If we treat all Mutations equally, we would apply that policy to both CDC and 
non-CDC enabled writes, so CDC space being filled / backpressure would reject 
all writes on the node.

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-04-29 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264753#comment-15264753
 ] 

DOAN DuyHai edited comment on CASSANDRA-8844 at 4/29/16 9:11 PM:
-

bq. I don't see reject all being viable since it'll shut down your cluster, and 
reject none on a slow consumer scenario would just lead to disk space 
exhaustion and lost CDC data.

 Didn't we agree in the past on the idea of having a parameter in Yaml that 
works similar to disk_failure_policy and let users decide which behavior they 
want on CDC overflow (reject writes/stop creating CDC/ ) ?


was (Author: doanduyhai):
> I don't see reject all being viable since it'll shut down your cluster, and 
> reject none on a slow consumer scenario would just lead to disk space 
> exhaustion and lost CDC data.

 Didn't we agreed in the past on the idea of having a parameter in Yaml that 
works similar to disk_failure_policy and let users decide which behavior they 
want on CDC overflow (reject writes/stop creating CDC/ ) ?

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-04-18 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245413#comment-15245413
 ] 

Branimir Lambov edited comment on CASSANDRA-8844 at 4/18/16 10:12 AM:
--

bq. the former couples the name with an intended usage / implementation whereas 
the latter is strictly a statement of what the object is without usage context

This sounds a lot like you would prefer to use a {{PairOfDoubles}} class to 
store complex numbers or planar coordinates. I wouldn't say that's wrong, just 
very _not_ useful and a missed opportunity.

If it were the case that we used {{ReplayPosition}} for another purpose than to 
specify the point in the commit log stream we can safely start replay from, I 
would agree with you, but that's not the case. Such a position happens to be a 
segment id plus offset; should the architecture change, a replay position may 
become something else and the modularity and abstraction is achieved by _not_ 
specifying what the object contains, but rather what it specifies. 

With your changes, a replay position actually does become something different, 
and to make it as clean and transparent as possible you may need both 
{{ReplayPosition}} and {{CommitLogPosition}} classes. The commit log now 
becomes _two_ continuous streams of data, each with its own _unrelated_ 
position. This means that they need to be treated separately, and both need to 
be accounted for. In particular, 
[{{CommitLogReplayer.construct}}|https://github.com/apache/cassandra/compare/trunk...josh-mckenzie:8844_review#diff-348a1347dacf897385fb0a97116a1b5eR113]
 is incorrect. Since you have two commit logs, each with their independent 
position, and you don't account for the log type when calculating the replay 
position to start from, you will fail to replay necessary records in one of the 
logs.

This needs more work and tests, especially around switching CDC status. We 
probably need to store the log type in the sstable (or some other kind of id 
that does not change when CDC is turned on/off).

bq. Regarding CommitLogPosition vs. CommitLogSegmentPosition, the class itself 
contains 2 instance variables: a segmentId and a position

A good analogy is an address in memory: it is composed of a 20/52-bit page id 
and a 12-bit offset within that page, yet it is still a memory address, while a 
memory page address would denote something very different. A commit log is a 
continuous stream of records. The fact that we split the stream in segments is 
an implementation detail.


was (Author: blambov):
bq. the former couples the name with an intended usage / implementation whereas 
the latter is strictly a statement of what the object is without usage context

This sounds a lot like you are would prefer to use a {{PairOfDoubles}} class to 
store complex numbers or planar coordinates. I wouldn't say that's wrong, just 
very _not_ useful and a missed opportunity.

If it were the case that we used {{ReplayPosition}} for another purpose than to 
specify the point in the commit log stream we can safely start replay from, I 
would agree with you, but that's not the case. Such a position happens to be a 
segment id plus offset; should the architecture change, a replay position may 
become something else and the modularity and abstraction is achieved by _not_ 
specifying what the object contains, but rather what it specifies. 

With your changes, a replay position actually does become something different, 
and to make it as clean and transparent as possible you may need both 
{{ReplayPosition}} and {{CommitLogPosition}} classes. The commit log now 
becomes _two_ continuous streams of data, each with its own _unrelated_ 
position. This means that they need to be treated separately, and both need to 
be accounted for. In particular, 
[{{CommitLogReplayer.construct}}|https://github.com/apache/cassandra/compare/trunk...josh-mckenzie:8844_review#diff-348a1347dacf897385fb0a97116a1b5eR113]
 is incorrect. Since you have two commit logs, each with their independent 
position, and you don't account for the log type when calculating the replay 
position to start from, you will fail to replay necessary records in one of the 
logs.

This needs more work and tests, especially around switching CDC status. We 
probably need to store the log type in the sstable (or some other kind of id 
that does not change when CDC is turned on/off).

bq. Regarding CommitLogPosition vs. CommitLogSegmentPosition, the class itself 
contains 2 instance variables: a segmentId and a position

A good analogy is an address in memory: it is composed of a 20/52-bit page id 
and a 12-bit offset within that page, yet it is still a memory address, while a 
memory page address would denote something very different. A commit log is a 
continuous stream of records. The fact that we split the stream in segments is 
an implementation detail.

> Change Data 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-04-07 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224998#comment-15224998
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 4/7/16 9:40 PM:


v1 is ready for review.

h5. General outline of changes in the patch
* CQL syntax changes to support CDC:
** CREATE KEYSPACE ks WITH replication... AND cdc_datacenters={'dc1','dc2'...}
** ALTER KEYSPACE ks DROP CDCLOG;
*** Cannot drop keyspaces w/CDC enabled without first disabling CDC.
* Changes to Parser.g to support sets being converted into maps. Reference 
normalizeSetOrMapLiteral, cleanMap, cleanSet
* Statement changes to support new keyspace param for Option.CDC_DATACENTERS
* Refactored {{CommitLogReplayer}} into {{CommitLogReplayer}}, 
{{CommitLogReader}}, and {{ICommitLogReadHandler}} in preparation for having a 
CDC consumer that needs to read commit log segments.
* Refactored commit log versioned deltas from various read* methods into 
{{CommitLogReader.CommitLogFormat}}
* Renamed {{ReplayPosition}} to {{CommitLogSegmentPosition}} (this is 
responsible for quite a bit of noise in the diff - sorry)
* Refactored {{CommitLogSegmentManager}} into:
** {{AbstractCommitLogSegmentManager}}
** {{CommitLogSegmentManagerStandard}}
*** Old logic for alloc (always succeed, block on allocate)
*** discard (delete if true)
*** unusedCapacity check (CL directory only)
** {{CommitLogSegmentManagerCDC}}
*** Fail alloc if atCapacity. We have an extra couple of atomic checks on the 
critical path for CDC-enabled (size + cdc overflow) and fail allocation if 
we're at limit. CommitLog now throws WriteTimeoutException for allocations 
returned null from CommitLog, which the standard should never do as it infinite 
loops in {{advanceAllocatingFrom}}.
*** Move files to cdc overflow folder as configured in yaml on discard
*** unusedCapacity includes lazy calculated size of CDC overflow as well. See 
DirectorySizerBench.java for why I went w/separate thread to lazy calculate 
size of overflow instead of doing it sync on failed allocation
*** Separate size limit configured in cassandra.yaml for CDC and CommitLog so 
they each have their own unusedCapacity checks. Went with 1/8th disk or 4096 on 
CDC as default, putting it at 1/2 the size of CommitLog.
* Refactored buffer management portions of {{FileDirectSegment}} into 
{{SimpleCachedBufferPool}}, owned by a {{CommitLogSegmentManager}} instance
** There's considerable logical overlap between this and BufferPool in general, 
though this is considerably simpler and purpose-built. I'm personally ok 
leaving it separate for now given it's simplicity.
* Some other various changes and movements around the code-base related to this 
patch ({{DirectorySizeCalculator}}, some javadoccing, typos I came across in 
comments or variable names while working on this, etc)

h5. What's not yet done:
* Consider running all / relevant CommitLog related unit tests against a 
CDC-based keyspace
* Performance testing (want to confirm that added determination of which 
{{CommitLogSegmentManager}} during write path is negligable impact along w/2 
atomic checks on CDC write-path)
* dtests specific to CDC
* fallout testing on CDC
* Any code-changes to specifically target supporting a consumer following a CDC 
log as it's being written in CommitLogReader / ICommitLogReader. A requester 
should be able to trivially handle that with the 
{{CommitLogReader.readCommitLogSegment}} signature supporting 
{{CommitLogSegmentPosition}} and {{mutationLimit}}, however, so I'm leaning 
towards not further polluting CommitLogReader / C* and keeping that in the 
scope of a consumption daemon

h5. Special point of concern:
* This patch changes us from an implicit singleton view of 
{{CommitLogSegmentManager}} to having multiple CommitLogSegmentManagers managed 
under the CommitLog. There have been quite a few places where I've come across 
undocumented assumptions that we only ever have 1 logical object allocating 
segments (the latest being FileDirectSegment uncovered by 
CommitLogSegmentManagerTest). I plan on again checking the code to make sure 
the new "calculate off multiple segment managers" view of some of the things 
exposed in the CommitLog interface don't violate their contract now that 
there's no longer single CLSM-atomicity on those results.

h5. Known issues:
* dtest is showing a pretty consistent error w/an inability to find a cdc 
CommitLogSegment during recovery that looks to be unique to the dtest env
* a few failures left in testall
* intermittent failure in the new {{CommitLogSegmentManagerCDCTest}} (3/150 
runs - on Windows, so I haven't yet ruled out an env. issue w/the testing)

[~blambov]: while [~carlyeks] is primary reviewer on this and quite familiar 
with the changes as he worked w/me on the design process, I'd also appreciate 
it if you could provide a backup pair of eyes and look over the 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-04-05 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227152#comment-15227152
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 4/5/16 9:24 PM:


Already in the 
[diff|https://github.com/apache/cassandra/compare/trunk...josh-mckenzie:8844#diff-7f177c2eab93884c78255b62b8aa50d0L389].
 Working for me locally - there something more that needs to be done that I 
don't know about?


was (Author: joshuamckenzie):
Check the 
[diff|https://github.com/apache/cassandra/compare/trunk...josh-mckenzie:8844#diff-7f177c2eab93884c78255b62b8aa50d0L389].
 Working for me locally - there something more that needs to be done that I 
don't know about?

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be configurable.
> - Logfiles should be named with a predictable naming schema, making it 
> triivial to process them in order.
> - Daemons should be able to checkpoint their work, and resume from where they 
> left off. This means they would have to leave some file artifact in the CDC 
> log's directory.
> - A sophisticated daemon should be able 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-03-22 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206830#comment-15206830
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 3/22/16 5:32 PM:
-

Something that came up during impl that got me thinking: we currently rely on 
all segments for a {{CommitLogSegmentManager}} to be located in the same 
directory. Easy enough design, reliable, rely on OS for separation etc. Good 
enough for our use-case thus far.

Adding a 2nd CommitLogSegmentManager muddies that water as we'd have some 
segments allocated by 1 allocator, some by another. Rather than go the route of 
sharing a directory for both CommitLogSegmentManagers and flagging type / 
ownership / responsibility by file name regex / filter, I'm leaning towards 
having cdc commit log segments exist in a subdirectory under commitlog, so:

{noformat}
data/commitlog
data/commitlog/cdc
{noformat}

This leads to the next observation that there's little point in having a 
cdc_overflow folder with this design as we can simply fail allocation when our 
/cdc folder reaches our configured size threshold. It's a little dicier on the 
"consumer deletes segments" front as there's no longer the differentiator of 
"any segment in this folder, we're done with it", however it's trivial to write 
the names of completed segment files to a local metadata file to indicate to 
consumers when we're done with segments.

The only other thing I can think of that's a downside: this will be a change 
for any other external tools / code that's relying on all segments to be stored 
in a single directory, hence my update here. Can anyone think of a really good 
reason why storing commit log segments in 2 separate directories for 2 separate 
managers would be a Bad Thing?

Edit: Just to clarify one thing: having flushed-to-sstable cdc files in the 
commitlog/cdc folder vs. in a cdc_overflow folder is a trivial delta w/some 
bookkeeping differences. Not a big deal nor what I was trying to get at above, 
so I'll probably end up moving those into cdc_overflow anyway just for 
separation.


was (Author: joshuamckenzie):
Something that came up during impl that got me thinking: we currently rely on 
all segments for a {{CommitLogSegmentManager}} to be located in the same 
directory. Easy enough design, reliable, rely on OS for separation etc. Good 
enough for our use-case thus far.

Adding a 2nd CommitLogSegmentManager muddies that water as we'd have some 
segments allocated by 1 allocator, some by another. Rather than go the route of 
sharing a directory for both CommitLogSegmentManagers and flagging type / 
ownership / responsibility by file name regex / filter, I'm leaning towards 
having cdc commit log segments exist in a subdirectory under commitlog, so:

{noformat}
data/commitlog
data/commitlog/cdc
{noformat}

This leads to the next observation that there's little point in having a 
cdc_overflow folder with this design as we can simply fail allocation when our 
/cdc folder reaches our configured size threshold. It's a little dicier on the 
"consumer deletes segments" front as there's no longer the differentiator of 
"any segment in this folder, we're done with it", however it's trivial to write 
the names of completed segment files to a local metadata file to indicate to 
consumers when we're done with segments.

The only other thing I can think of that's a downside: this will be a change 
for any other external tools / code that's relying on all segments to be stored 
in a single directory, hence my update here. Can anyone think of a really good 
reason why storing commit log segments in 2 separate directories for 2 separate 
managers would be a Bad Thing?

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-02-23 Thread Bill de hOra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158608#comment-15158608
 ] 

Bill de hOra edited comment on CASSANDRA-8844 at 2/23/16 9:21 AM:
--

In the current design doc it says 

bq. Matches replication strategy

Does this mean if RF was say, three, that three CDC commit logs would be 
written to across the cluster (compared to say, one write at the coordinator)? 
In turn I guess that means systems consuming the capture logs will have to 
perform some kind of de-duplication as de-duplication's not in scope for the 
design.



was (Author: dehora):
In the current design doc it says 

bq. Matches replication strategy

Does this mean if RF was say, three, that three CDC commit logs being written 
to across the cluster and not just say, at the coordinator? In turn I guess 
that means systems consuming the capture logs will have to perform some kind of 
de-duplication as de-duplication's not in scope for the design.


> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be configurable.
> - Logfiles should be named with a 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-02-23 Thread Bill de hOra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158608#comment-15158608
 ] 

Bill de hOra edited comment on CASSANDRA-8844 at 2/23/16 9:20 AM:
--

In the current design doc it says 

bq. Matches replication strategy

Does this mean if RF was say, three, that three CDC commit logs being written 
to across the cluster and not just say, at the coordinator? In turn I guess 
that means systems consuming the capture logs will have to perform some kind of 
de-duplication as de-duplication's not in scope for the design.



was (Author: dehora):
In the current design doc it says 

bq. Matches replication strategy

Does this means if RF was say, three, that three CDC commit logs being written 
to across the cluster and not just say, at the coordinator? In turn I guess 
that means systems consuming the capture logs will have to perform some kind of 
de-duplication as de-duplication's not in scope for the design.


> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be configurable.
> - Logfiles should be named with a predictable naming 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-01-07 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088444#comment-15088444
 ] 

DOAN DuyHai edited comment on CASSANDRA-8844 at 1/8/16 12:15 AM:
-

I've read the updated design doc and I have a concern with the following 
proposal:

- _.yaml configurable limit of on-disk space allowed to be take up by cdc 
directory. If at or above limit, throw UnavailableException on CDC-enabled 
mutations_

 I certainly understand the need to raise a warning if the on-disk space limit 
for CDC overflows, but raising an UnavailableException will basically block the 
server for any future write (until the disk space is released). This situation 
occurs when CDC client does not "consume" CDC log as fast as C* flush incoming 
data. So we have basically a sizing/throughput issue with the consumer.

 Throwing UnavailableException is rather radical, and I certainly understand 
the need to prevent any desync between base data and consumer, but raising a 
WARNING or at least, proposing different failure strategy (similar to 
**disk_failure_policy**) like EXCEPTION_ON_OVERFLOW, WARN_ON_OVERFLOW, 
DISCARD_OLD_ON_OVERFLOW would offers some flexibility. Not sure how much 
complexity it would add to the actual impl. WDYT ?


was (Author: doanduyhai):
I've read the updated design doc and I have a concern with the following 
proposal:

- _.yaml configurable limit of on-disk space allowed to be take up by cdc 
directory. If at or above limit, throw UnavailableException on CDC-enabled 
mutations_

 I certainly understand the need to raise a warning if the on-disk space limit 
for CDC overflows, but raising an UnavailableException will basically blocks 
the server for any future write (until the disk space is released). This 
situation occurs when CDC client does not "consume" CDC log as fast as C* flush 
incoming data. So we have basically a sizing/throughput issue with the consumer.

 Throwing UnavailableException is rather radical, and I certainly understand 
the need to prevent any desync between base data and consumer, but raising a 
WARNING or at least, proposing different failure strategy (similar to 
**disk_failure_policy**) like EXCEPTION_ON_OVERFLOW, WARN_ON_OVERFLOW, 
DISCARD_OLD_ON_OVERFLOW would offers some flexibility. Not sure how much 
complexity it would add to the actual impl. WDYT ?

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2015-12-21 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049151#comment-15049151
 ] 

Ariel Weisberg edited comment on CASSANDRA-8844 at 12/21/15 6:38 PM:
-

I don't want to scope creep this ticket. I think that this is heading the right 
direction in terms of deferring most of the functionality around consumption of 
CDC data and getting a good initial implementation of buffering and writing the 
data.

I do want to splat somewhere my thoughts on the consumption side. VoltDB had a 
CDC feature that went through several iterations over the years as we learned 
what did and didn't work.

The original implementation was a wire protocol that clients could connect to. 
The protocol was a pain and the client had to be a distributed system with 
consensus in order to load balance and fail over across multiple client 
instances and the implementation we maintained for people to plug into was a 
pain because we had to connect to all the nodes to acknowledge consumed CDC 
data at replicas. And all of this was without the benefit of already being a 
cluster member with access to failure information. The clients also had to know 
way too much about cluster internals and topology to do it well.

For the rewrite I ended up hosting CDC data processors inside the server. In 
practice this is not as scary as it may sound to some. Most of the processors 
were written by us, and there wasn't a ton they could do to misbehave without 
trying really hard and if they did that it was on them. It didn't end up being 
a support or maintenance headache, and I don't think we had instances of the 
CDC processing destabilizing things.

You could make the data available over a socket as one of these processors, 
there was a JDBC processor to insert into a database via JDBC, there was a 
Kafka processor to load data into Kafka, one to load the data into another 
VoltDB instance, and a processor that wrote the data to local disk as a CSV etc.

The processor implemented by users didn't have to do anything to deal with fail 
over and load balancing of consuming data. The database hosting the processor 
would only pass data for a given range on the hash ring to one processor at a 
time. When a processor acknowledged data as committed downstream the database 
transparently sends the acknowledgement to all replicas allowing them to 
release persisted CDC data. VoltDB runs ZooKeeper on top of VoltDB internally 
so this was pretty easy to implement inside VoltDB, but outside it would have 
been a pain.

The goal was that CDC data would never hit the filesystem, and that if it hit 
the filesystem it wouldn't hit disk if possible. Heap promotion and survivor 
copying had to be non-existent to avoid having an impact on GC pause time. With 
TPC and buffering mutations before passing them to the processors we had no 
problem getting data out at disk or line rate. Reclaiming spaced ended up being 
file deletion so that was cheap as well.


was (Author: aweisberg):
I don't want to scope creep this ticket. I think that this is heading the write 
direction in terms of deferring most of the functionality around consumption of 
CDC data and getting a good initial implementation of buffering and writing the 
data.

I do want to splat somewhere my thoughts on the consumption side. VoltDB had a 
CDC feature that went through several iterations over the years as we learned 
what did and didn't work.

The original implementation was a wire protocol that clients could connect to. 
The protocol was a pain and the client had to be a distributed system with 
consensus in order to load balance and fail over across multiple client 
instances and the implementation we maintained for people to plug into was a 
pain because we had to connect to all the nodes to acknowledge consumed CDC 
data at replicas. And all of this was without the benefit of already being a 
cluster member with access to failure information. The clients also had to know 
way too much about cluster internals and topology to do it well.

For the rewrite I ended up hosting CDC data processors inside the server. In 
practice this is not as scary as it may sound to some. Most of the processors 
were written by us, and there wasn't a ton they could do to misbehave without 
trying really hard and if they did that it was on them. It didn't end up being 
a support or maintenance headache, and I don't think we had instances of the 
CDC processing destabilizing things.

You could make the data available over a socket as one of these processors, 
there was a JDBC processor to insert into a database via JDBC, there was a 
Kafka processor to load data into Kafka, one to load the data into another 
VoltDB instance, and a processor that wrote the data to local disk as a CSV etc.

The processor implemented by users didn't have to do anything to deal with fail 
over and 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2015-12-21 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049151#comment-15049151
 ] 

Ariel Weisberg edited comment on CASSANDRA-8844 at 12/21/15 6:35 PM:
-

I don't want to scope creep this ticket. I think that this is heading the write 
direction in terms of deferring most of the functionality around consumption of 
CDC data and getting a good initial implementation of buffering and writing the 
data.

I do want to splat somewhere my thoughts on the consumption side. VoltDB had a 
CDC feature that went through several iterations over the years as we learned 
what did and didn't work.

The original implementation was a wire protocol that clients could connect to. 
The protocol was a pain and the client had to be a distributed system with 
consensus in order to load balance and fail over across multiple client 
instances and the implementation we maintained for people to plug into was a 
pain because we had to connect to all the nodes to acknowledge consumed CDC 
data at replicas. And all of this was without the benefit of already being a 
cluster member with access to failure information. The clients also had to know 
way too much about cluster internals and topology to do it well.

For the rewrite I ended up hosting CDC data processors inside the server. In 
practice this is not as scary as it may sound to some. Most of the processors 
were written by us, and there wasn't a ton they could do to misbehave without 
trying really hard and if they did that it was on them. It didn't end up being 
a support or maintenance headache, and I don't think we had instances of the 
CDC processing destabilizing things.

You could make the data available over a socket as one of these processors, 
there was a JDBC processor to insert into a database via JDBC, there was a 
Kafka processor to load data into Kafka, one to load the data into another 
VoltDB instance, and a processor that wrote the data to local disk as a CSV etc.

The processor implemented by users didn't have to do anything to deal with fail 
over and load balancing of consuming data. The database hosting the processor 
would only pass data for a given range on the hash ring to one processor at a 
time. When a processor acknowledged data as committed downstream the database 
transparently sends the acknowledgement to all replicas allowing them to 
release persisted CDC data. VoltDB runs ZooKeeper on top of VoltDB internally 
so this was pretty easy to implement inside VoltDB, but outside it would have 
been a pain.

The goal was that CDC data would never hit the filesystem, and that if it hit 
the filesystem it wouldn't hit disk if possible. Heap promotion and survivor 
copying had to be non-existent to avoid having an impact on GC pause time. With 
TPC and buffering mutations before passing them to the processors we had no 
problem getting data out at disk or line rate. Reclaiming spaced ended up being 
file deletion so that was cheap as well.


was (Author: aweisberg):
I don't want to scope creep this ticket. I think that this is heading the write 
direction in terms of deferring most of the functionality around consumption of 
CDC data and getting a good initial implementation of buffering and writing the 
data.

I do want to splat somewhere my thoughts on the consumption side. VoltDB had a 
CDC feature that went through several iterations over the years as we learned 
what did and didn't work.

The original implementation was a wire protocol that clients could connect to. 
The protocol was a pain and the client had to be a distributed system with 
consensus in order to load balance and fail over across multiple client 
instances and the implementation we maintained for people to plug into was a 
pain because we had to connect to all the nodes to acknowledge consumed CDC 
data at replicas. And all of this was without the benefit of already being a 
cluster member with access to failure information. The clients also had to know 
way too much about cluster internals and topology to do it well.

For the rewrite I ended up hosting CDC data processors inside the server. In 
practice this is not as scary as it may sound to some. Most of the processors 
were written by us, and there wasn't a ton they could do to misbehave without 
trying really hard and if they did that it was on them. It didn't end up being 
a support or maintenance headache, and I don't think we didn't have instances 
of the CDC processing destabilizing things.

You could make the data available over a socket as one of these processors, 
there was a JDBC processor to insert into a database via JDBC, there was a 
Kafka processor to load data into Kafka, one to load the data into another 
VoltDB instance, and a processor that wrote the data to local disk as a CSV etc.

The processor implemented by users didn't have to do anything to deal with fail 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2015-12-09 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049125#comment-15049125
 ] 

Joshua McKenzie edited comment on CASSANDRA-8844 at 12/9/15 6:26 PM:
-

Read a little further down in the comments (and yeah, I should probably update 
the design doc - haven't been active on this ticket for a bit). Current 
approach is to write to CL and compact to CDC log on CL flush, meaning you get 
the same guarantees for CDC writes as you get for general writes by virtue of 
it piggybacking on the standard write process (sync window considerations, etc).

CL is source of truth, and we compact the CDC-enabled CF into a separate log so 
we don't have to munge w/coordinating marking records as clean in CL segments 
from an external source and injecting that external dep. into our CL 
management, as 1st pass of CDC is going to be CDC-log file only w/external 
user-supplied consumption daemon.

re: different place/size bound for CDC log, we don't want nodes to fall over 
because of unbounded CDC data on disk. User-configurable limit on disk-space to 
allocate for CDC and UE on CDC-enabled writes when at limit gives us that 
guarantee.

For our use-case / design, CDC data is a straight mirror of table data.


was (Author: joshuamckenzie):
Read a little further down in the comments (and yeah, I should probably update 
the design doc - haven't been active on this ticket for a bit). Current 
approach is to write / compact to CL and CDC log on CL flush, meaning you get 
the same guarantees for CDC writes as you get for general rights by virtue of 
it piggybacking on the standard write process. CL is source of truth, and we 
compact the CDC-enabled CF into a separate log so we don't have to munge 
w/coordinating marking records as clean in CL segments from an external source, 
as 1st pass of CDC is going to be CDC-log file only w/external user-supplied 
consumption daemon.

re: different place/size bound for CDC log, we don't want nodes to fall over 
because of unbounded CDC data on disk. User-configurable limit on disk-space to 
allocate for CDC and UE on CDC-enabled writes when at limit gives us that 
guarantee.

For our use-case / design, CDC data is a straight mirror of table data.

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2015-09-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907007#comment-14907007
 ] 

DOAN DuyHai edited comment on CASSANDRA-8844 at 9/24/15 8:58 PM:
-

Questions on the operational side:

1) What happens with repair (nodetool or read-repair). Does the proposed impl 
re-push the notifications to consumers ?
2) What happens with the corner case when a replica DID received a mutation but 
did not ack the coordiniator in timely manner so it will receive a hint later ? 
Notification pushed twice ?
3) What happens in case of  a new node joining accepting writes for the token 
range as well as the old node that is still accepting writes for this portion 
of token range ? Notification will be pushed to any consumer attached to the 
"joining" node ?
4) What happens with the write survey mode ? Do we push notifications in this 
case ?

I know that the Deliver-at-least-once semantics allow us to send notifications 
more than once but it's always good to clarify all those ops scenarios to have 
less surprise when the feature is deployed


was (Author: doanduyhai):
Questions on the operational side:

1) What happens with repair (nodetool or read-repair). Does the proposed impl 
re-push the notifications to consumers ?
2) What happens with the corner case when a replica DID received a mutation but 
did not ack the coordiniator in timely manner so it will receive a hint later ? 
Notification pushed twice ?
3) What happens in case of  a new node joining accepting writes for the token 
range as well as the old node that is still accepting writes for this portion 
of token range ? Notification will be pushed to any consumer attached to the 
"joining" node ?
4) What happens with the write survey ? Do we push notifications in this case ?

I know that the Deliver-at-least-once semantics allow us to send notifications 
more than once but it's always good to clarify all those ops scenarios to have 
less surprise when the feature is deployed

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2015-04-30 Thread Martin Kersten (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522461#comment-14522461
 ] 

Martin Kersten edited comment on CASSANDRA-8844 at 4/30/15 11:29 PM:
-

I would argue not for logs but for 'listener' queries for each table. 

If a client want to listen for a certain or all changes he is free to submit 
where clauses. So every time a changed row of that table fulfills the where 
clause the client listener gets notified.

A client may issue many where clauses and would be able to query its active 
where clauses for a table. By removing all where clauses the listener will 
actually remove itself from the listener list for that table.

One could even extend that by submitting real select statements working on only 
the currently change row. Maybe the listener may even add a timing setting 
allowing a node to aggregate multiple update events and send one single 
notification for multiple changed rows (if this makes sense for a system using 
hashing for partition / sharding).

Since many clients may listen using the same where clause the performance would 
be manageable and not depending on how many clients are listening but how many 
different select statements where listened to by all clients.

By being able to name the listening where clauses one could even generated 
named events where a combination of name and where clause makes it unique. 

An additional option is being able to add an additional value for each row 
containing the names of all event names associated with the listener select 
statements (queries).

Pro:
  * Easy to understand
  * Easy to manage 
  * Fine-tuning is possible (like a single client listens to only a single user 
or a list of particular users)
  * Lot of reuse capabilities (just query over the changed row(s) not all rows, 
grammar etc.)
  * Works on shared tables
  * Avoids maintaining disk logs
  * Only three operations are necessary to implement (add listener query, 
remove listener query and get all active listener queries).
  * Performance improvements possible by combining (almost) similar where 
clauses and by adding special cases (like where clauses for certain user IDs 
will result in a huge lists.
  * All in memory operation no disk writes necessary

Contra:
   * Ensuring  being notified exactly once might be challenging
   * Only works on changes about to happen afterwards and not changes happened 
in recent time (no travel back in time like it would be possible while having 
change logs). 


was (Author: martin kersten):
I would argue not for logs but for 'listener' queries for each table. 

If a client want to listen for a certain or all changes he is free to submit 
where clauses. So every time a changed row of that table fulfills the where 
clause the client listener gets notified.

A client may issue many where clauses and would be able to query its active 
where clauses for a table. By removing all where clauses the listener will 
actually remove itself from the listener list for that table.

One could even extend that by submitting real select statements working on only 
the currently change row. Maybe the listener may even add a timing setting 
allowing a node to aggregate multiple update events and send one single 
notification for multiple changed rows (if this makes sense for a system using 
hashing for partition / sharding).

Since many clients may listen using the same where clause the performance would 
be manageable and not depending on how many clients are listening but how many 
different select statements where listened to by all clients.

By being able to name the listening where clauses one could even generated 
named events where a combination of name and where clause makes it unique. 

An additional option is being able to add an additional value for each row 
containing the names of all event names associated with the listener select 
statements (queries).

Pro:
  * Easy to understand
  * Easy to manage 
  * Fine-tuning is possible (like a single client listens to only a single user 
or a list of particular users)
  * Lot of reuse capabilities (just query over the changed row(s) not all rows, 
grammar etc.)
  * Works on shared tables
  * Avoids maintaining disk logs
  * Only three operations are necessary to implement (add listener query, 
remove listener query and get all active listener queries).
  * Performance improvements possible by combining (almost) similar where 
clauses and by adding special cases (like where clauses for certain user IDs 
will result in a huge lists.
  * All in memory operation no disk writes necessary

Contra:
   * Ensuring  being notified exactly once might be challenging
   * Only works on changes about to happen afterwards and not changes happened 
in recent time. 

 Change Data Capture (CDC)
 -

 Key: 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2015-02-26 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338836#comment-14338836
 ] 

Helena Edelson edited comment on CASSANDRA-8844 at 2/26/15 6:19 PM:


I think I need this for the Spark Cassandra Connector 
https://datastax-oss.atlassian.net/browse/SPARKC-40



was (Author: helena_e):
I think I need this for the Spark Cassandra Connector 
https://issues.apache.org/jira/browse/CASSANDRA-8844

 Change Data Capture (CDC)
 -

 Key: CASSANDRA-8844
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Tupshin Harper
 Fix For: 3.1


 In databases, change data capture (CDC) is a set of software design patterns 
 used to determine (and track) the data that has changed so that action can be 
 taken using the changed data. Also, Change data capture (CDC) is an approach 
 to data integration that is based on the identification, capture and delivery 
 of the changes made to enterprise data sources.
 -Wikipedia
 As Cassandra is increasingly being used as the Source of Record (SoR) for 
 mission critical data in large enterprises, it is increasingly being called 
 upon to act as the central hub of traffic and data flow to other systems. In 
 order to try to address the general need, we (cc [~brianmhess]), propose 
 implementing a simple data logging mechanism to enable per-table CDC patterns.
 h2. The goals:
 # Use CQL as the primary ingestion mechanism, in order to leverage its 
 Consistency Level semantics, and in order to treat it as the single 
 reliable/durable SoR for the data.
 # To provide a mechanism for implementing good and reliable 
 (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
 continuous semi-realtime feeds of mutations going into a Cassandra cluster.
 # To eliminate the developmental and operational burden of users so that they 
 don't have to do dual writes to other systems.
 # For users that are currently doing batch export from a Cassandra system, 
 give them the opportunity to make that realtime with a minimum of coding.
 h2. The mechanism:
 We propose a durable logging mechanism that functions similar to a commitlog, 
 with the following nuances:
 - Takes place on every node, not just the coordinator, so RF number of copies 
 are logged.
 - Separate log per table.
 - Per-table configuration. Only tables that are specified as CDC_LOG would do 
 any logging.
 - Per DC. We are trying to keep the complexity to a minimum to make this an 
 easy enhancement, but most likely use cases would prefer to only implement 
 CDC logging in one (or a subset) of the DCs that are being replicated to
 - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
 commitlog, failure to write to the CDC log should fail that node's write. If 
 that means the requested consistency level was not met, then clients *should* 
 experience UnavailableExceptions.
 - Be written in a Row-centric manner such that it is easy for consumers to 
 reconstitute rows atomically.
 - Written in a simple format designed to be consumed *directly* by daemons 
 written in non JVM languages
 h2. Nice-to-haves
 I strongly suspect that the following features will be asked for, but I also 
 believe that they can be deferred for a subsequent release, and to guage 
 actual interest.
 - Multiple logs per table. This would make it easy to have multiple 
 subscribers to a single table's changes. A workaround would be to create a 
 forking daemon listener, but that's not a great answer.
 - Log filtering. Being able to apply filters, including UDF-based filters 
 would make Casandra a much more versatile feeder into other systems, and 
 again, reduce complexity that would otherwise need to be built into the 
 daemons.
 h2. Format and Consumption
 - Cassandra would only write to the CDC log, and never delete from it. 
 - Cleaning up consumed logfiles would be the client daemon's responibility
 - Logfile size should probably be configurable.
 - Logfiles should be named with a predictable naming schema, making it 
 triivial to process them in order.
 - Daemons should be able to checkpoint their work, and resume from where they 
 left off. This means they would have to leave some file artifact in the CDC 
 log's directory.
 - A sophisticated daemon should be able to be written that could 
 -- Catch up, in written-order, even when it is multiple logfiles behind in 
 processing
 -- Be able to continuously tail the most recent logfile and get 
 low-latency(ms?) access to the data as it is written.
 h2. Alternate approach
 In order to make consuming a change log easy and efficient to do with low 
 latency, the following could supplement the approach outlined above
 - Instead