This is exactly what I was looking for, as I am reading a lot about Hadoop at the same time. Haven't got any experience with partitioning alignment so far, so I would appreciate any suggestions on how to approach this topic efficiently. But this shouldn't be a problem as I still have until October...
Now I just have to convince my academic advisor. Thanks so far I think this topic is definitly worth to look into. 2014/1/9 Michael Oczkowski <[email protected]> > +1 for this idea. I heard DataStax was investigating Storm integration > (like they do with Hadoop) but so far as I know this isn’t going to > happen. The need for push-down analytics is great and a very general > problem and any nice solution would help many people! > > > > Also to Brian’s point it would be great to use Storm in lieu of Hadoop if > it’s performant. > > > > *From:* [email protected] [mailto:[email protected]] *On Behalf Of *Adam > Lewis > *Sent:* Thursday, January 9, 2014 9:11 AM > *To:* user > > *Subject:* Re: Strom research suggestions > > > > I love it; even if it is a premature optimization the beauty of academic > work is that this should be measurable and is still an interesting finding > either way. I don't have the large scale production experience with storm > that others here have (yet), but it sounds like it would really help > performance since you're going after network transfer. And as you say, > Svend, all the ingredients are already built in to trident. > > > > Adam > > > > On Thu, Jan 9, 2014 at 10:56 AM, Brian O'Neill <[email protected]> > wrote: > > > > +1, love the idea. I’ve wanted to play with partitioning alignment myself > (with C*), but i’ve been too busy with the day job. =) > > > > Tobias, if you need some support — don’t hesitate to reach out. > > > > If you are able to align the partitioning, and we can add “in-place” > computation within Storm, it would be great to see a speed comparison > between Hadoop and Storm. (If comparable, it may drive people to abandon > their Hadoop infrastructure for batch processing, and run everything on > Storm) > > > > -brian > > > > --- > > Brian O'Neill > > Chief Architect > > *Health Market Science* > > *The Science of Better Results* > > 2700 Horizon Drive • King of Prussia, PA • 19406 > > M: 215.588.6024 • @boneill42 <http://www.twitter.com/boneill42> • > > healthmarketscience.com > > > > This information transmitted in this email message is for the intended > recipient only and may contain confidential and/or privileged material. If > you received this email in error and are not the intended recipient, or the > person responsible to deliver it to the intended recipient, please contact > the sender at the email above and delete this email and any attachments and > destroy any copies thereof. Any review, retransmission, dissemination, > copying or other use of, or taking any action in reliance upon, this > information by persons or entities other than the intended recipient is > strictly prohibited. > > > > > > *From: *Svend Vanderveken <[email protected]> > *Reply-To: *<[email protected]> > *Date: *Thursday, January 9, 2014 at 10:46 AM > *To: *<[email protected]> > *Subject: *Re: Strom research suggestions > > > > Hey Tobias, > > > > > > Nice project, I would have loved to play with something like storm back in > my university days :) > > > > Here's a topic that's been on my mind for a while (Trident API of storm): > > > > > > * one core idea of distributed map reduce à la hadoop was to perform as > much processing as possible close to the data: you execute the "map" > locally on each node where the data sits, you do a first reduce there, then > you let the result travel through the network, you do one last reduce > centrally and you have a result without having all your DB travel the > network everytime > > > > * Storm groupBy + persistentAggregate + reducer/combiner let us have a > similar semantic, where we map incoming tuples, reduce them with other > tuples in the same group + with previously reduced value stored in DB at > regular interval > > > > * for each group, the operation above happens always on the same Storm > Task (i.e. the same "place" in the cluster) and stores its ongoing state in > the "same place" in DB, using the group value as primary key > > > > I believe it might be worth investigating if the following pattern would > make sense: > > > > * install a distributed state store (e..g cassandra) on the same nodes as > the Storm workers > > > > * try to align the Storm partitioning triggered by the groupby with > Cassandra partitioning, so that under usual happy circumstances (no crash), > the Storm reduction is happening on the node where Cassandra is storing > that particular primary key, avoiding the network travel for the > persistence. > > > > > > What do you think? Premature optimization? Does not make sense? Great > idea? Let me know :) > > > > > > S > > > > > > > > On Thu, Jan 9, 2014 at 3:00 PM, Tobias Pazer <[email protected]> > wrote: > > Hi all, > > I have recently started writing my master thesis with a focus on storm, as > we are planning to implement the lambda architecture in our university. > > As it's still not very clear for me where exactly it's worth to dive into, > I was hoping one of you might have any suggestions. > > I was thinking about a benchmark or something else to systematically > evaluate and improve the configuration of storm, but I'm not sure if this > is even worth the time. > > I think the more experienced of you definitely have further ideas! > > Thanks and regards > Tobias > > > > >
