The typical pattern I've seen in the field is kafka + consumers for each 
destination (variant of dual write I know), this of course would not work for 
your goal of relying on C* for dedup. Triggers would also suffer the same 
problem unfortunately so you're really left with a batch job (most likely 
Spark) to move data from C* into HDFS on a given interval. If this is really a 
cold storage use case that can work quite well especially assuming you've 
modeled your data as a time series or with some sort of time based bucketing so 
you can quickly get full partitions data out of C* in a deterministic fashion 
and not have to scan your entire data set.

I've also for similar needs have seen Spark streaming + querying cassandra for 
duplication checks to dedup then output to another source (form of dual write 
but with dedup), this was really silly and slow. I only bring it up to save you 
the trouble in case you end up in the same path chasing for something more 
'real time'.

Regards,
Ryan Svihla

On Aug 9, 2016, 11:09 AM -0500, Ben Vogan <b...@shopkick.com>, wrote:
> Hi all,
>
> We are investigating using Cassandra in our data platform. We would like data 
> to go into Cassandra first and to eventually be replicated into our data lake 
> in HDFS for long term cold storage. Does anyone know of a good way of doing 
> this? We would rather not have parallel writes to HDFS and Cassandra because 
> we were hoping that we could use Cassandra primary keys to de-duplicate 
> events.
>
> Thanks, --
>
> BENJAMIN VOGAN | Data Platform Team Lead
> shopkick (http://www.shopkick.com/)
> The indispensable app that rewards you for shopping.

Reply via email to