RE: Strom research suggestions

Michael Oczkowski Thu, 09 Jan 2014 08:56:40 -0800

+1 for this idea.  I heard DataStax was investigating Storm integration (like 
they do with Hadoop) but so far as I know this isn't going to happen.  The need 
for push-down analytics is great and a very general problem and any nice 
solution would help many people!

Also to Brian's point it would be great to use Storm in lieu of Hadoop if it's 
performant.

From: [email protected] [mailto:[email protected]] On Behalf Of Adam Lewis
Sent: Thursday, January 9, 2014 9:11 AM
To: user
Subject: Re: Strom research suggestions

I love it; even if it is a premature optimization the beauty of academic work 
is that this should be measurable and is still an interesting finding either 
way.  I don't have the large scale production experience with storm that others 
here have (yet), but it sounds like it would really help performance since 
you're going after network transfer.  And as you say, Svend, all the 
ingredients are already built in to trident.

Adam

On Thu, Jan 9, 2014 at 10:56 AM, Brian O'Neill 
<[email protected]<mailto:[email protected]>> wrote:

+1, love the idea.  I've wanted to play with partitioning alignment myself 
(with C*), but i've been too busy with the day job. =)

Tobias, if you need some support - don't hesitate to reach out.

If you are able to align the partitioning, and we can add "in-place" 
computation within Storm, it would be great to see a speed comparison between 
Hadoop and Storm.   (If comparable, it may drive people to abandon their Hadoop 
infrastructure for batch processing, and run everything on Storm)

-brian

---
Brian O'Neill
Chief Architect
Health Market Science
The Science of Better Results
2700 Horizon Drive * King of Prussia, PA * 19406
M: 215.588.6024<tel:215.588.6024> * 
@boneill42<http://www.twitter.com/boneill42>  *
healthmarketscience.com

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.

From: Svend Vanderveken 
<[email protected]<mailto:[email protected]>>
Reply-To: 
<[email protected]<mailto:[email protected]>>
Date: Thursday, January 9, 2014 at 10:46 AM
To: <[email protected]<mailto:[email protected]>>
Subject: Re: Strom research suggestions

Hey Tobias,

Nice project, I would have loved to play with something like storm back in my 
university days :)

Here's a topic that's been on my mind for a while (Trident API of storm):

* one core idea of distributed map reduce à la hadoop was to perform as much 
processing as possible close to the data: you execute the "map" locally on each 
node where the data sits, you do a first reduce there, then you let the result 
travel through the network, you do one last reduce centrally and you have a 
result without having all your DB travel the network everytime

* Storm groupBy + persistentAggregate + reducer/combiner let us have a similar 
semantic, where we map incoming tuples, reduce them with other tuples in the 
same group + with previously reduced value stored in DB at regular interval

* for each group, the operation above happens always on the same Storm Task 
(i.e. the same "place" in the cluster) and stores its ongoing state in the 
"same place" in DB, using the group value as primary key

I believe it might be worth investigating if the following pattern would make 
sense:

* install a distributed state store (e..g cassandra) on the same nodes as the 
Storm workers

* try to align the Storm partitioning triggered by the groupby with Cassandra 
partitioning, so that under usual happy circumstances (no crash), the Storm 
reduction is happening on the node where Cassandra is storing that particular 
primary key, avoiding the network travel for the persistence.

What do you think? Premature optimization? Does not make sense? Great idea? Let 
me know :)

S

On Thu, Jan 9, 2014 at 3:00 PM, Tobias Pazer 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,

I have recently started writing my master thesis with a focus on storm, as we 
are planning to implement the lambda architecture in our university.

As it's still not very clear for me where exactly it's worth to dive into, I 
was hoping one of you might have any suggestions.

I was thinking about a benchmark or something else to systematically evaluate 
and improve the configuration of storm, but I'm not sure if this is even worth 
the time.

I think the more experienced of you definitely have further ideas!

Thanks and regards
Tobias

RE: Strom research suggestions

Reply via email to