Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-12 Thread Jay Zhuang
We have the similar use case: Streamific, the Ingestion Service for Hadoop Big Data at Uber Engineering . We had this data ingestion pipeline built on MySQL/schemaless before using Cassandra. For Cassandra, we used to

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-12 Thread DuyHai Doan
The biggest problem of having CDC working correctly in C* is the deduplication issue. Having a process to read incoming mutation from commitlog is not that hard, having to dedup them through N replicas is much harder The idea is : why don't we generate the CDC event directly at the coordinator

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-11 Thread Joy Gao
Re Rahul: "Although DSE advanced replication does one way, those are use cases with limited value to me because ultimately it’s still a master slave design." Completely agree. I'm not familiar with Calvin protocol, but that sounds interesting (reading time...). On Tue, Sep 11, 2018 at 8:38 PM

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-11 Thread Joy Gao
Thank you all for the feedback so far. The immediate use case for us is setting up a real-time streaming data pipeline from C* to our Data Warehouse (BigQuery), where other teams can access the data for reporting/analytics/ad-hoc query. We already do this with MySQL

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-11 Thread Rahul Singh
You know what they say: Go big or go home. Right now candidates are Cassandra itself but embedded or on the side not on the actual data clusters, zookeeper (yuck) , Kafka (which needs zookeeper, yuck) , S3 (outside service dependency, so no go. ) Jeff, Those are great patterns. ESP. Second

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-10 Thread Jeff Jirsa
On Sun, Sep 9, 2018 at 6:09 AM Jonathan Haddad wrote: > I'll be honest, I'm having a hard time wrapping my head around an > architecture where you use CDC to push data into Kafka. I've worked on > plenty of systems that use Kafka as a means of communication, and one of > the consumers is a

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-10 Thread DuyHai Doan
Also using Calvin means having to implement a distributed monotonic sequence as a primitive, not trivial at all ... On Mon, Sep 10, 2018 at 3:08 PM, Rahul Singh wrote: > In response to mimicking Advanced replication in DSE. I understand the > goal. Although DSE advanced replication does one

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-10 Thread Rahul Singh
In response to mimicking Advanced replication in DSE. I understand the goal. Although DSE advanced replication does one way, those are use cases with limited value to me because ultimately it’s still a master slave design. I’m working on a prototype for this for two way replication between

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-10 Thread Rahul Singh
Not everyone has it their way like Frank Sinatra. Due to various reasons, folks need to get the changes in Cassandra to be duplicated to a topic for further processing - especially if the new system owner doesn’t own the whole platform. There are various ways to do this but you have to deal

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-10 Thread Dinesh Joshi
> On Sep 9, 2018, at 6:08 AM, Jonathan Haddad > wrote: > > There may be some use cases for it.. but I'm not sure what they are. It > might help if you shared the use cases where the extra complexity is > required? When does writing to Cassandra which then dedupes

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-09 Thread Jonathan Haddad
I'll be honest, I'm having a hard time wrapping my head around an architecture where you use CDC to push data into Kafka. I've worked on plenty of systems that use Kafka as a means of communication, and one of the consumers is a process that stores data in Cassandra. That's pretty normal.

Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-06 Thread Joy Gao
Hi all, We are fairly new to Cassandra. We began looking into the CDC feature introduced in 3.0. As we spent more time looking into it, the complexity began to add up (i.e. duplicated mutation based on RF, out of order mutation, mutation does not contain full row of data, etc). These limitations