Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Hugo José Pinto Sat, 03 Jan 2015 15:51:54 -0800

Thanks :)

Duly noted - this is all uncharted territory for us, hence the value of 
seasoned advice.



Best

--
Hugo José Pinto

No dia 03/01/2015, às 23:43, Peter Lin <wool...@gmail.com> escreveu:

> 
> listen to colin's advice, avoid the temptation of anti-patterns.
> 
>> On Sat, Jan 3, 2015 at 6:10 PM, Colin <colpcl...@gmail.com> wrote:
>> Use a message bus with a transactional get, get the message, send to 
>> cassandra, upon write success, submit to esp, commit get on bus.  Messaging 
>> systems like rabbitmq support this semantic.
>> 
>> Using cassandra as a queuing mechanism is an anti-pattern.
>> 
>> --
>> Colin Clark 
>> +1-320-221-9531
>>  
>> 
>>> On Jan 3, 2015, at 6:07 PM, Hugo José Pinto <hugo.pi...@inovaworks.com> 
>>> wrote:
>>> 
>>> Thank you all for your answers.
>>> 
>>> It seems I'll have to go with some event-driven processing before/during 
>>> the Cassandra write path. 
>>> 
>>> My concern would be that I'd love to first guarantee the disk write of the 
>>> Cassandra persistence and then do the event processing (which is mostly 
>>> CRUD intercepts at this point), even if slightly delayed, and doing so via 
>>> triggers would probably bog down the whole processing pipeline. 
>>> 
>>> What I'd probably do is to write, in trigger, a separate key table with all 
>>> the CRUDed elements and to have the ESP process that table.
>>> 
>>> Thank you for your contribution. Should anyone else have any experiende 
>>> experience in these scenarios I'm obviously all ears as well. 
>>> 
>>> Best,
>>> 
>>> Hugo 
>>> 
>>> Enviado do meu iPhone
>>> 
>>> No dia 03/01/2015, às 11:09, DuyHai Doan <doanduy...@gmail.com> escreveu:
>>> 
>>>> Hello Hugo
>>>> 
>>>>  I was facing the same kind of requirement from some users. Long story 
>>>> short, below are the possible strategies with advantages and draw-backs of 
>>>> each
>>>> 
>>>> 1) Put Spark in front of the back-end, every incoming 
>>>> modification/update/insert goes into Spark first, then Spark will forward 
>>>> it to Cassandra for persistence. With Spark, you can perform pre or 
>>>> post-processing and notify external clients of mutation.
>>>> 
>>>>  The draw back of this solution is that all the incoming mutations must go 
>>>> through Spark. You may set up a Kafka queue as temporary storage to 
>>>> distribute the load and consume mutations with Spark but it add ups to the 
>>>> architecture complexity with additional components & technologies
>>>> 
>>>> 2) For high availability and resilience, you probably want to have all 
>>>> mutations saved first into Cassandra then process notifications with 
>>>> Spark. In this case the only way to have notifications from Cassandra, as 
>>>> of version 2.1, is to rely on manually coded triggers (which is still 
>>>> experimental feature).
>>>> 
>>>> With the triggers you can notify whatever clients you want, not only Spark.
>>>> 
>>>> The big draw back of this solution is that playing with triggers is 
>>>> dangerous if you are not familiar with Cassandra internals. Indeed the 
>>>> trigger is on the write path and may hurt performance if you are doing 
>>>> complex and blocking tasks.
>>>> 
>>>> That's the 2 solutions I can see, maybe the ML members will propose other 
>>>> innovative choices
>>>> 
>>>>  Regards
>>>> 
>>>>> On Sat, Jan 3, 2015 at 11:46 AM, Hugo José Pinto 
>>>>> <hugo.pi...@inovaworks.com> wrote:
>>>>> Hello.
>>>>> 
>>>>> We're currently using Hazelcast (http://hazelcast.org/) as a distributed 
>>>>> in-memory data grid. That's been working sort-of-well for us, but going 
>>>>> solely in-memory has exhausted its path in our use case, and we're 
>>>>> considering porting our application to a NoSQL persistent store. After 
>>>>> the usual comparisons and evaluations, we're borderline close to picking 
>>>>> Cassandra, plus eventually Spark for analytics.
>>>>> 
>>>>> Nonetheless, there is a gap in our architectural needs that we're still 
>>>>> not grasping how to solve in Cassandra (with or without Spark): Hazelcast 
>>>>> allows us to create a Continuous Query in that, whenever a row is 
>>>>> added/removed/modified from the clause's resultset, Hazelcast calls up 
>>>>> back with the corresponding notification. We use this to continuously 
>>>>> update the clients via AJAX streaming with the new/changed rows.
>>>>> 
>>>>> This is probably a conceptual mismatch we're making, so - how to best 
>>>>> address this use case in Cassandra (with or without Spark's help)? Is 
>>>>> there something in the API that allows for Continuous Queries on 
>>>>> key/clause changes (haven't found it)? Is there some other way to get a 
>>>>> stream of key/clause updates? Events of some sort?
>>>>> 
>>>>> I'm aware that we could, eventually, periodically poll Cassandra, but in 
>>>>> our use case, the client is potentially interested in a large number of 
>>>>> table clause notifications (think "all changes to Ship positions on 
>>>>> California's coastline"), and iterating out of the store would kill the 
>>>>> streamer's scalability.
>>>>> 
>>>>> Hence, the magic question: what are we missing? Is Cassandra the wrong 
>>>>> tool for the job? Are we not aware of a particular part of the API or 
>>>>> external library in/outside the apache realm that would allow for this?
>>>>> 
>>>>> Many thanks for any assistance!
>>>>> 
>>>>> Hugo
>>>>> 
>>>> 
>

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Reply via email to