Re: Trident Binned Aggregation Design Help

Andrew Serff Tue, 17 Jun 2014 15:10:41 -0700

I'm currently using Redis, but I'm by no means tied to it.  Are there any
example of using either to do that?


Andrew


On Tue, Jun 17, 2014 at 3:51 PM, P. Taylor Goetz <[email protected]> wrote:

> Andrew/Adam,
>
> Partitioning operations like groupBy() form the bolt boundaries in trident
> topologies, so the more you have the more bolts you will have and thus,
> potentially, more network transfer.
>
> What backing store are you using for persistence? If you are using
> something with counter support like HBase or Cassandra you could leverage
> that in combination with tridents exactly once semantics to let it handle
> the counting, and potentially greatly reduce the complexity of your
> topology.
>
> -Taylor
>
> On Jun 17, 2014, at 5:15 PM, Adam Lewis <[email protected]> wrote:
>
> I, too, am eagerly awaiting a reply from the list on this topic.  I hit up
> against max topology size limits doing something similar with trident.
>  There are definitely "linear" changes to a trident topology that result in
> quadratic growth of the "compiled" storm topology size, such as adding DRPC
> spouts.  Sadly the compilation process of trident to plain storm remains
> somewhat opaque to me and I haven't had time to dig deeper.  My work around
> has been to limit myself to one DRPC spout per topology and
> programmatically build multiple topologies for the variations (which
> results in a lot of structural and functional duplication of deployed
> topologies, but at least not code duplication).
>
> Trident presents a seemingly nice abstraction, but from my point of view
> it is a leaky one if I need to understand the compilation process to know
> why adding a single DRPC spout double the topology size.
>
>
>
> On Tue, Jun 17, 2014 at 4:33 PM, Andrew Serff <[email protected]> wrote:
>
>> Is there no one out there that can help with this?  If I use this
>> paradigm, my topology ends up having like 170 bolts.  Then I add DRPC
>> stream and I have like 50 spouts.  All of this adds up to a topology that I
>> can't even submit because it's too large (and i've bumped the trident max
>> to 50mb already...).  It seems like I'm thinking about this wrong, but I
>> haven't be able to come up with another way to do it.  I don't really see
>> how using vanilla Storm would help, maybe someone can offer some guidance?
>>
>> Thanks
>> Andrew
>>
>>
>> On Mon, Jun 9, 2014 at 11:45 AM, Andrew Serff <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I'm new to using Trident and had a few questions about the best way to
>>> do things in this framework.  I'm trying to build a real-time streaming
>>> aggregation system and Trident seems to have a very easy framework to allow
>>> me to do that.  I have a basic setup working, but as I am adding more
>>> counters, the performance becomes very slow and eventually I start having
>>> many failures.  At the basic level here is what I want to do:
>>>
>>> Have an incoming stream that is using a KafkaSpout to read data from.
>>> I take the Kafka stream, parse it and output multiple fields.
>>> I then want many different counters for those fields in the data.
>>>
>>> For example, say it was the twitter stream.  I may want to count:
>>> - A counter for each username I come across.  So how many times I have
>>> received a tweet from each user
>>> - A counter for each hashtag so you know how many tweets mention a
>>> hashtag
>>> - Binned counters based on date for each tweet (i.e. how many tweets in
>>> 2014, June 2014, June 08 2014, etc).
>>>
>>> The list could continue, but this can add up to hundreds of counters
>>> running in real time.  Right now I have something like the following:
>>>
>>> TridentTopology topology = new TridentTopology();
>>> KafkaSpout spout = new KafkaSpout(kafkaConfig);
>>> Stream stream = topology.newStream("messages", spout).shuffle()
>>>                 .each(new Fields("str"), new FieldEmitter(), new
>>> Fields("username", "hashtag"));
>>>
>>> stream.groupBy(new Fields("username"))
>>>                 .persistentAggregate(stateFactory, new
>>> Fields("username"), new Count(), new Fields("count"))
>>>                 .parallelismHint(6);
>>> stream.groupBy(new Fields("hashtag"))
>>>                 .persistentAggregate(stateFactory, new
>>> Fields("hashtag"), new Count(), new Fields("count"))
>>>                 .parallelismHint(6);
>>>
>>> Repeat something similar for everything you want to have a unique count
>>> for.
>>>
>>> I end up having hundreds of GroupBys each that has an aggregator for
>>> each.  I have so far only run this on my local machine and not on a cluster
>>> yet, but I'm wondering if this is the correct design for something like
>>> this or if there is a better way to distribute this within Trident to make
>>> it more efficient.
>>>
>>> Any suggestions would be appreciated!
>>> Thanks!
>>> Andrew
>>>
>>
>>
>

Re: Trident Binned Aggregation Design Help

Reply via email to