Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Hugo José Pinto Sat, 03 Jan 2015 15:08:29 -0800

Thank you all for your answers.

It seems I'll have to go with some event-driven processing before/during the 
Cassandra write path.


My concern would be that I'd love to first guarantee the disk write of the 
Cassandra persistence and then do the event processing (which is mostly CRUD 
intercepts at this point), even if slightly delayed, and doing so via triggers 
would probably bog down the whole processing pipeline. 

What I'd probably do is to write, in trigger, a separate key table with all the 
CRUDed elements and to have the ESP process that table.

Thank you for your contribution. Should anyone else have any experiende 
experience in these scenarios I'm obviously all ears as well. 

Best,

Hugo 

Enviado do meu iPhone

No dia 03/01/2015, às 11:09, DuyHai Doan <doanduy...@gmail.com> escreveu:

> Hello Hugo
> 
>  I was facing the same kind of requirement from some users. Long story short, 
> below are the possible strategies with advantages and draw-backs of each
> 
> 1) Put Spark in front of the back-end, every incoming 
> modification/update/insert goes into Spark first, then Spark will forward it 
> to Cassandra for persistence. With Spark, you can perform pre or 
> post-processing and notify external clients of mutation.
> 
>  The draw back of this solution is that all the incoming mutations must go 
> through Spark. You may set up a Kafka queue as temporary storage to 
> distribute the load and consume mutations with Spark but it add ups to the 
> architecture complexity with additional components & technologies
> 
> 2) For high availability and resilience, you probably want to have all 
> mutations saved first into Cassandra then process notifications with Spark. 
> In this case the only way to have notifications from Cassandra, as of version 
> 2.1, is to rely on manually coded triggers (which is still experimental 
> feature).
> 
> With the triggers you can notify whatever clients you want, not only Spark.
> 
> The big draw back of this solution is that playing with triggers is dangerous 
> if you are not familiar with Cassandra internals. Indeed the trigger is on 
> the write path and may hurt performance if you are doing complex and blocking 
> tasks.
> 
> That's the 2 solutions I can see, maybe the ML members will propose other 
> innovative choices
> 
>  Regards
> 
>> On Sat, Jan 3, 2015 at 11:46 AM, Hugo José Pinto <hugo.pi...@inovaworks.com> 
>> wrote:
>> Hello.
>> 
>> We're currently using Hazelcast (http://hazelcast.org/) as a distributed 
>> in-memory data grid. That's been working sort-of-well for us, but going 
>> solely in-memory has exhausted its path in our use case, and we're 
>> considering porting our application to a NoSQL persistent store. After the 
>> usual comparisons and evaluations, we're borderline close to picking 
>> Cassandra, plus eventually Spark for analytics.
>> 
>> Nonetheless, there is a gap in our architectural needs that we're still not 
>> grasping how to solve in Cassandra (with or without Spark): Hazelcast allows 
>> us to create a Continuous Query in that, whenever a row is 
>> added/removed/modified from the clause's resultset, Hazelcast calls up back 
>> with the corresponding notification. We use this to continuously update the 
>> clients via AJAX streaming with the new/changed rows.
>> 
>> This is probably a conceptual mismatch we're making, so - how to best 
>> address this use case in Cassandra (with or without Spark's help)? Is there 
>> something in the API that allows for Continuous Queries on key/clause 
>> changes (haven't found it)? Is there some other way to get a stream of 
>> key/clause updates? Events of some sort?
>> 
>> I'm aware that we could, eventually, periodically poll Cassandra, but in our 
>> use case, the client is potentially interested in a large number of table 
>> clause notifications (think "all changes to Ship positions on California's 
>> coastline"), and iterating out of the store would kill the streamer's 
>> scalability.
>> 
>> Hence, the magic question: what are we missing? Is Cassandra the wrong tool 
>> for the job? Are we not aware of a particular part of the API or external 
>> library in/outside the apache realm that would allow for this?
>> 
>> Many thanks for any assistance!
>> 
>> Hugo
>> 
>

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Reply via email to