Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-11 Thread Joy Gao
Re Rahul:  "Although DSE advanced replication does one way, those are use
cases with limited value to me because ultimately it’s still a master slave
design."
Completely agree. I'm not familiar with Calvin protocol, but that sounds
interesting (reading time...).

On Tue, Sep 11, 2018 at 8:38 PM Joy Gao  wrote:

> Thank you all for the feedback so far.
>
> The immediate use case for us is setting up a real-time streaming data
> pipeline from C* to our Data Warehouse (BigQuery), where other teams can
> access the data for reporting/analytics/ad-hoc query. We already do this
> with MySQL
> <https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka>,
> where we stream the MySQL Binlog via Debezium <https://debezium.io>'s
> MySQL Connector to Kafka, and then use a BigQuery Sink Connector to stream
> data to BigQuery.
>
> Re Jon's comment about why not write to Kafka first? In some cases that
> may be ideal; but one potential concern we have with writing to Kafka first
> is not having "read-after-write" consistency. The data could be written to
> Kafka, but not yet consumed by C*. If the web service issues a (quorum)
> read immediately after the (quorum) write, the data that is being returned
> could still be outdated if the consumer did not catch up. Having web
> service interacts with C* directly solves this problem for us (we could add
> a cache before writing to Kafka, but that adds additional operational
> complexity to the architecture; alternatively, we could write to Kafka and
> C* transactionally, but distributed transaction is slow).
>
> Having the ability to stream its data to other systems could make C* more
> flexible and more easily integrated into a larger data ecosystem. As Dinesh
> has mentioned, implementing this in the database layer means there is a
> standard approach to getting a change notification stream (unlike trigger
> which is ad-hoc and customized). Aside from replication, the change events
> could be used for updating Elasticsearch, generating derived views (i.e.
> for reporting), sending to an audit services, sending to a notification
> service, and in our case, streaming to our data warehouse for analytics.
> (one article that goes over database streaming is Martin Kleppman's Turning
> the Database Inside Out with Apache Samza
> <https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/>,
> which seems relevant here). For reference, this turning database into a
> stream of change events is pretty common in SQL databases (i.e. mysql
> binlog, postgres WAL) and NoSQL databases that have primary-replica setup
> (i.e. Mongodb Oplog). Recently CockroachDB introduced a CDC feature as well
> (and they have master-less replication too).
>
> Hope that answers the question. That said, dedupe/ordering/getting full
> row of data via C* CDC is a hard problem, but may be worth solving for
> reasons mentioned above. Our proposal is an user approach to solve these
> problems. Maybe the more sensible thing to do is to build it as part of C*
> itself, but that's a much bigger discussion. If anyone is building a
> streaming pipeline for C*, we'd be interested in hearing their approaches
> as well.
>
>
> On Tue, Sep 11, 2018 at 7:01 AM Rahul Singh 
> wrote:
>
>> You know what they say: Go big or go home.
>>
>> Right now candidates are Cassandra itself but embedded or on the side not
>> on the actual data clusters, zookeeper (yuck) , Kafka (which needs
>> zookeeper, yuck) , S3 (outside service dependency, so no go. )
>>
>> Jeff, Those are great patterns. ESP. Second one. Have used it several
>> times. Cassandra is a great place to store data in transport.
>>
>>
>> Rahul
>> On Sep 10, 2018, 5:21 PM -0400, DuyHai Doan ,
>> wrote:
>>
>> Also using Calvin means having to implement a distributed monotonic
>> sequence as a primitive, not trivial at all ...
>>
>> On Mon, Sep 10, 2018 at 3:08 PM, Rahul Singh <
>> rahul.xavier.si...@gmail.com> wrote:
>>
>>> In response to mimicking Advanced replication in DSE. I understand the
>>> goal. Although DSE advanced replication does one way, those are use cases
>>> with limited value to me because ultimately it’s still a master slave
>>> design.
>>>
>>> I’m working on a prototype for this for two way replication between
>>> clusters or databases regardless of dB tech - and every variation I can get
>>> to comes down to some implementation of the Calvin protocol which basically
>>> verifies the change in either cluster , sequences it according to impact to
>>> underlying data, and then schedules the mutation in 

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-11 Thread Joy Gao
ology platforms.
>> On Sep 10, 2018, 3:58 AM -0400, Dinesh Joshi 
>> ,
>> wrote:
>>
>> On Sep 9, 2018, at 6:08 AM, Jonathan Haddad  wrote:
>>
>> There may be some use cases for it.. but I'm not sure what they are.  It
>> might help if you shared the use cases where the extra complexity is
>> required?  When does writing to Cassandra which then dedupes and writes to
>> Kafka a preferred design then using Kafka and simply writing to Cassandra?
>>
>>
>> From the reading of the proposal, it seems bring functionality similar to
>> MySQL's binlog to Kafka connector. This is useful for many applications
>> that want to be notified when certain (or any) rows change in the database
>> primarily for a event driven application architecture.
>>
>> Implementing this in the database layer means there is a standard
>> approach to getting a change notification stream. Downstream subscribers
>> can then decide which notifications to act on.
>>
>> LinkedIn's databus is similar in functionality -
>> https://github.com/linkedin/databus However it is for heterogenous
>> datastores.
>>
>> On Thu, Sep 6, 2018 at 1:53 PM Joy Gao  wrote:
>>
>>>
>>>
>>> We have a* WIP design doc
>>> <https://wepayinc.box.com/s/fmdtw0idajyfa23hosf7x4ustdhb0ura>* that
>>> goes over this idea in details.
>>>
>>> We haven't sort out all the edge cases yet, but would love to get some
>>> feedback from the community on the general feasibility of this approach.
>>> Any ideas/concerns/questions would be helpful to us. Thanks!
>>>
>>>
>> Interesting idea. I did go over the proposal briefly. I concur with Jon
>> about adding more use-cases to clarify this feature's potential use-cases.
>>
>> Dinesh
>>
>>
>


Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-06 Thread Joy Gao
Hi all,

We are fairly new to Cassandra. We began looking into the CDC feature
introduced in 3.0. As we spent more time looking into it, the complexity
began to add up (i.e. duplicated mutation based on RF, out of order
mutation, mutation does not contain full row of data, etc). These
limitations have already been mentioned in the discussion thread in
CASSANDRA-8844, so we understand the design decisions around this. However,
we do not want to push solving this complexity to every downstream
consumers, where they each have to handle
deduping/ordering/read-before-write to get full row; instead we want to
solve them earlier in the pipeline, so the change message are
deduped/ordered/complete by the time they arrive in Kafka. Dedupe can be
solved with a cache, and ordering can be solved since mutations have
timestamps, but the one we have the most trouble with is not having the
full row of data.

We had a couple discussions with some folks in other companies who are
working on applying CDC feature for their real-time data pipelines. On a
high-level, the common feedback we gathered is to use a stateful processing
approach to maintain a separate db which mutations are applied to, which
then allows them to construct the "before" and "after" data without having
to query the original Cassandra db on each mutation. The downside of this
is the operational overhead of having to maintain this intermediary db for
CDC.

We have an unconventional idea (inspired by DSE Advanced Replication) that
eliminates some of the operational overhead, but with tradeoff of
increasing code complexity and memory pressure. The high level idea is a
stateless processing approach where we have a process in each C* node that
parse mutation from CDC logs and query local node to get the "after" data,
which avoid network hops and thus making reading full-row of data more
efficient. We essentially treat the mutations in CDC log as change
notifications. To solve dedupe/ordering, only the primary node for each
token range will send the data to Kafka, but data are reconciled with peer
nodes to prevent data loss.

We have a* WIP design doc
* that goes
over this idea in details.

We haven't sort out all the edge cases yet, but would love to get some
feedback from the community on the general feasibility of this approach.
Any ideas/concerns/questions would be helpful to us. Thanks!

Joy


CDC and TTL

2018-06-18 Thread Joy Gao
Hi all!

I recently started to look into Cassandra CDC implementation. One question
that occurred to me is how/if TTL is handled for CDC. For example, If I
insert some data with TTL enabled and expiring in 60 seconds, will CDC be
aware of these changes 60 seconds later when the TTL expired? If not, then
the cdc consumer won't know that the data has actually been deleted. A
related question is where in the life cycle are expired TTLs getting
detected and marked with tombstone? I am aware that during compaction,
cells marked with tombstones will be deleted, but not sure if compaction
will also look for cells with expired TTLs to mark them with tombstone?

Cheers,
Joy