Tupshin Harper created CASSANDRA-8844:
-----------------------------------------

             Summary: Change Data Capture (CDC)
                 Key: CASSANDRA-8844
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Tupshin Harper
             Fix For: 3.1


"In databases, change data capture (CDC) is a set of software design patterns 
used to determine (and track) the data that has changed so that action can be 
taken using the changed data. Also, Change data capture (CDC) is an approach to 
data integration that is based on the identification, capture and delivery of 
the changes made to enterprise data sources."
-Wikipedia

As Cassandra is increasingly being used as the Source of Record (SoR) for 
mission critical data in large enterprises, it is increasingly being called 
upon to act as the central hub of traffic and data flow to other systems. In 
order to try to address the general need, we (cc [~brianmhess]), propose 
implementing a simple data logging mechanism to enable per-table CDC patterns.

h2. The goals:
# Use CQL as the primary ingestion mechanism, in order to leverage its 
Consistency Level semantics, and in order to treat it as the single 
reliable/durable SoR for the data.
# To provide a mechanism for implementing good and reliable 
(deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
continuous semi-realtime feeds of mutations going into a Cassandra cluster.
# To eliminate the developmental and operational burden of users so that they 
don't have to do dual writes to other systems.
# For users that are currently doing batch export from a Cassandra system, give 
them the opportunity to make that realtime with a minimum of coding.

The mechanism:
We propose a durable logging mechanism that functions similar to a commitlog, 
with the following nuances:
- Takes place on every node, not just the coordinator, so RF number of copies 
are logged.
- Separate log per table.
- Per-table configuration. Only tables that are specified as CDC_LOG would do 
any logging.
- Per DC. We are trying to keep the complexity to a minimum to make this an 
easy enhancement, but most likely use cases would prefer to only implement CDC 
logging in one (or a subset) of the DCs that are being replicated to
- In the critical path of ConsistencyLevel acknowledgment. Just as with the 
commitlog, failure to write to the CDC log should fail that node's write. If 
that means the requested consistency level was not met, then clients *should* 
experience UnavailableExceptions.
- Be written in a Row-centric manner such that it is easy for consumers to 
reconstitute rows atomically.
- Written in a simple format designed to be consumed *directly* by daemons 
written in non JVM languages

h2. Nice-to-haves
I strongly suspect that the following features will be asked for, but I also 
believe that they can be deferred for a subsequent release, and to guage actual 
interest.
- Multiple logs per table. This would make it easy to have multiple 
"subscribers" to a single table's changes. A workaround would be to create a 
forking daemon listener, but that's not a great answer.
- Log filtering. Being able to apply filters, including UDF-based filters would 
make Casandra a much more versatile feeder into other systems, and again, 
reduce complexity that would otherwise need to be built into the daemons.

h2. Format and Consumption
- Cassandra would only write to the CDC log, and never delete from it. 
- Cleaning up consumed logfiles would be the client daemon's responibility
- Logfile size should probably be configurable.
- Logfiles should be named with a predictable naming schema, making it triivial 
to process them in order.
- Daemons should be able to checkpoint their work, and resume from where they 
left off. This means they would have to leave some file artifact in the CDC 
log's directory.
- A sophisticated daemon should be able to be written that could 
-- Catch up, in written-order, even when it is multiple logfiles behind in 
processing
-- Be able to continuously "tail" the most recent logfile and get 
low-latency(ms?) access to the data as it is written.

h2. Alternate approach
In order to make consuming a change log easy and efficient to do with low 
latency, the following could supplement the approach outlined above
- Instead of writing to a logfile, by default, Cassandra could expose a socket 
for a daemon to connect to, and from which it could pull each row.
- Cassandra would have a limited buffer for storing rows, should the listener 
become backlogged, but it would immediately spill to disk in that case, never 
incurring large in-memory costs.

h2. Additional consumption possibility
With all of the above, still relevant:
- instead (or in addition to) using the other logging mechanisms, use CQL 
transport itself as a logger.
- Extend the CQL protoocol slightly so that rows of data can be return to a 
listener that didn't explicit make a query, but instead registered itself with 
Cassandra as a listener for a particular event type, and in this case, the 
event type would be anything that would otherwise go to a CDC log.
- If there is no listener for the event type associated with that log, or if 
that listener gets backlogged, the rows will again spill to the persistent 
storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to