Thanks Ryan. I was hoping there was a change data capture framework. We have late arriving events, some of which can be very late. We would have to batch collect data for a large time period every so often to go back and collect those or accept that we are going to lose a small percentage of events. Neither of which is ideal.
On Tue, Aug 9, 2016 at 10:30 AM, Ryan Svihla <r...@foundev.pro> wrote: > The typical pattern I've seen in the field is kafka + consumers for each > destination (variant of dual write I know), this of course would not work > for your goal of relying on C* for dedup. Triggers would also suffer the > same problem unfortunately so you're really left with a batch job (most > likely Spark) to move data from C* into HDFS on a given interval. If this > is really a cold storage use case that can work quite well especially > assuming you've modeled your data as a time series or with some sort of > time based bucketing so you can quickly get full partitions data out of C* > in a deterministic fashion and not have to scan your entire data set. > > I've also for similar needs have seen Spark streaming + querying cassandra > for duplication checks to dedup then output to another source (form of dual > write but with dedup), this was really silly and slow. I only bring it up > to save you the trouble in case you end up in the same path chasing for > something more 'real time'. > > Regards, > Ryan Svihla > > On Aug 9, 2016, 11:09 AM -0500, Ben Vogan <b...@shopkick.com>, wrote: > > Hi all, > > We are investigating using Cassandra in our data platform. We would like > data to go into Cassandra first and to eventually be replicated into our > data lake in HDFS for long term cold storage. Does anyone know of a good > way of doing this? We would rather not have parallel writes to HDFS and > Cassandra because we were hoping that we could use Cassandra primary keys > to de-duplicate events. > > Thanks, > -- > <http://shopkick.com/> > *BENJAMIN VOGAN* | Data Platform Team Lead > shopkick <http://www.shopkick.com/> > <http://facebook.com/shopkick> <http://instagram.com/shopkick> > <http://pinterest.com/shopkick> <http://twitter.com/shopkick> > <https://www.linkedin.com/company/831240?trk=tyah&trkInfo=clickedVertical%3Acompany%2CentityType%3AentityHistoryName%2CclickedEntityId%3Acompany_831240%2Cidx%3A0> > > The indispensable app that rewards you for shopping. > > -- <http://shopkick.com/> *BENJAMIN VOGAN* | Data Platform Team Lead shopkick <http://www.shopkick.com/> <http://facebook.com/shopkick> <http://instagram.com/shopkick> <http://pinterest.com/shopkick> <http://twitter.com/shopkick> <https://www.linkedin.com/company/831240?trk=tyah&trkInfo=clickedVertical%3Acompany%2CentityType%3AentityHistoryName%2CclickedEntityId%3Acompany_831240%2Cidx%3A0> The indispensable app that rewards you for shopping.