Let me rephrase that, I want to know the id of each component at runtime, can I do that? For example when I set a bolt with an id of "b1" I want to be able to retrieve that string during runtime within that bolt; so I can perform something like what you are doing with the kafka partition offsets... but again I cannot find in the API a call that retrieves that...
Andrew. On Thu, Jul 17, 2014 at 4:21 AM, Tomas Mazukna <[email protected]> wrote: > Kafka client handles that, it is stored in zookeeper with the offset. > I wrote a kafka spout based on kafka groups consumer api. Kafka allows > only one consumer per partition per group. > > > On Wed, Jul 16, 2014 at 8:41 PM, Andrew Xor <[email protected]> > wrote: > >> Ok, but upon runtime how to you set in the spout which kafka partition to >> subscribe at? >> >> Kindly yours, >> >> Andrew Grammenos >> >> -- PGP PKey -- >> <https://www.dropbox.com/s/2kcxe59zsi9nrdt/pgpsig.txt> >> https://www.dropbox.com/s/ei2nqsen641daei/pgpsig.txt >> >> >> On Thu, Jul 17, 2014 at 3:30 AM, Tomas Mazukna <[email protected]> >> wrote: >> >>> So you want to define only one instance of the spout that reads the >>> file. Number of bolts will only depend on how fast you need to process the >>> data. >>> I have a topology that has a spout with parallelism of 40 - connected to >>> 40 partitions of a kafka topic. It send traffic to the first bolt which has >>> parallelism 320. The whole topology is split up into 4 workers. that makes >>> 10 spout instances in each jvm, feeding 80 bolts. In my case I have >>> grouping so tuples get routed to different physical machines. >>> >>> Tomas >>> >>> >>> On Wed, Jul 16, 2014 at 8:10 PM, Andrew Xor <[email protected] >>> > wrote: >>> >>>> Michael, >>>> >>>> Thanks for the response but I think another problem arises; as I >>>> just cooked up a small example the increased number of workers only spawns >>>> mirrors of the topology. This poses a problem for me due to the fact that >>>> my spout reads from a very big file and converts each line into a tuple and >>>> feeds that in the topology. What I wanted to do in the first place is to >>>> actually send each tuple produced to a different subscribed bolt each time >>>> (using Round Robing or smth) so that each one of them got 1/n nth (where n >>>> the number of bolts) of the input stream. If I spawn 2 workers both will >>>> read the same file and emit the same tuples so both topology workers will >>>> produce the same results. >>>> >>>> I wanted to avoid to create a spout that takes a file offset as an >>>> input and wire a lot more stuff than I have to; so I was trying to find a >>>> way to perform what I told you in an elegant and scalable fashion...so far >>>> I have found nil. >>>> >>>> >>>> On Thu, Jul 17, 2014 at 2:57 AM, Michael Rose <[email protected]> >>>> wrote: >>>> >>>>> It doesn't say so, but if you have 4 workers, the 4 executors will be >>>>> shared evenly over the 4 workers. Likewise, 16 will partition 4 each. The >>>>> only case where a worker will not get a specific executor is when there >>>>> are >>>>> less executors than workers (e.g. 8 workers, 4 executors), 4 of the >>>>> workers >>>>> will receive an executor but the others will not. >>>>> >>>>> It sounds like for your case, shuffle+parallelism is more than >>>>> sufficient. >>>>> >>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>> [email protected] >>>>> >>>>> >>>>> On Wed, Jul 16, 2014 at 5:53 PM, Andrew Xor < >>>>> [email protected]> wrote: >>>>> >>>>>> Hey Stephen, Michael, >>>>>> >>>>>> Yea I feared as much... as searching the docs and API did not >>>>>> surface any reliable and elegant way of doing that unless you had a >>>>>> "RouterBolt". If setting the parallelism of a component is enough for >>>>>> load >>>>>> balancing the processes across different machines that are part of the >>>>>> Storm cluster then this would suffice in my use case. Although here >>>>>> <https://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html> >>>>>> the documentation says executors are threads and it does not explicitly >>>>>> say >>>>>> anywhere that threads are spawned across different nodes of the >>>>>> cluster... >>>>>> I want to avoid the possibility of these threads only spawning locally >>>>>> and >>>>>> not in a distributed fashion among the cluster nodes.. >>>>>> >>>>>> Andrew. >>>>>> >>>>>> >>>>>> On Thu, Jul 17, 2014 at 2:46 AM, Michael Rose < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Maybe we can help with your topology design if you let us know what >>>>>>> you're doing that requires you to shuffle half of the whole stream >>>>>>> output >>>>>>> to each of the two different types of bolts. >>>>>>> >>>>>>> If bolt b1 and bolt b2 are both instances of ExampleBolt (and not >>>>>>> two different types) as above, there's no point to doing this. Setting >>>>>>> the >>>>>>> parallelism will make sure that data is partitioned across machines (by >>>>>>> default, setting parallelism sets tasks = executors = parallelism). >>>>>>> >>>>>>> Unfortunately, I don't know of any way to do this other than >>>>>>> shuffling the output to a new bolt, e.g. bolt "b0" a 'RouterBolt', then >>>>>>> having bolt b0 round-robin the received tuples between two streams, then >>>>>>> have b1 and b2 shuffle over those streams instead. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> On Wed, Jul 16, 2014 at 5:40 PM, Andrew Xor < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> Hi Tomas, >>>>>>>> >>>>>>>> As I said in my previous mail the grouping is for a bolt *task* >>>>>>>> not for the actual number of spawned bolts; for example let's say you >>>>>>>> have >>>>>>>> two bolts that have a parallelism hint of 3 and these two bolts are >>>>>>>> wired >>>>>>>> to the same spout. If you set the bolts as such: >>>>>>>> >>>>>>>> tb.setBolt("b1", new ExampleBolt(), 2 /* p-hint >>>>>>>> */).shuffleGrouping("spout1"); >>>>>>>> tb.setBolt("b2", new ExampleBolt(), 2 /* p-hint >>>>>>>> */).shuffleGrouping("spout1"); >>>>>>>> >>>>>>>> Then each of the tasks will receive half of the spout tuples but >>>>>>>> each actual spawned bolt will receive all of the tuples emitted from >>>>>>>> the >>>>>>>> spout. This is more evident if you set up a counter in the bolt >>>>>>>> counting >>>>>>>> how many tuples if has received and testing this with no parallelism >>>>>>>> hint >>>>>>>> as such: >>>>>>>> >>>>>>>> tb.setBolt("b1", new ExampleBolt(),).shuffleGrouping("spout1"); >>>>>>>> tb.setBolt("b2", new ExampleBolt()).shuffleGrouping("spout1"); >>>>>>>> >>>>>>>> Now you will see that both bolts will receive all tuples emitted by >>>>>>>> spout1. >>>>>>>> >>>>>>>> Hope this helps. >>>>>>>> >>>>>>>> >>>>>>>> Andrew. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 17, 2014 at 2:33 AM, Tomas Mazukna < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Andrew, >>>>>>>>> >>>>>>>>> when you connect your bolt to your spout you specify the grouping. >>>>>>>>> If you use shuffle grouping then any free bolt gets the tuple - in my >>>>>>>>> experience even in lightly loaded topologies the distribution amongst >>>>>>>>> bolts >>>>>>>>> is pretty even. If you use all grouping then all bolts receive a copy >>>>>>>>> of >>>>>>>>> the tuple. >>>>>>>>> Use shuffle grouping and each of your bolts will get about 1/3 of >>>>>>>>> the workload. >>>>>>>>> >>>>>>>>> Tomas >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jul 16, 2014 at 7:05 PM, Andrew Xor < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> H >>>>>>>>>> i, >>>>>>>>>> >>>>>>>>>> I am trying to distribute the spout output to it's subscribed >>>>>>>>>> bolts evenly; let's say that I have a spout that emits tuples and >>>>>>>>>> three >>>>>>>>>> bolts that are subscribed to it. I want each of the three bolts to >>>>>>>>>> receive >>>>>>>>>> 1/3 rth of the output (or emit a tuple to each one of these bolts in >>>>>>>>>> turns). Unfortunately as far as I understand all bolts will receive >>>>>>>>>> all of >>>>>>>>>> the emitted tuples of that particular spout regardless of the >>>>>>>>>> grouping >>>>>>>>>> defined (as grouping from my understanding is for bolt *tasks* not >>>>>>>>>> actual >>>>>>>>>> bolts). >>>>>>>>>> >>>>>>>>>> I've searched a bit and I can't seem to find a way to accomplish >>>>>>>>>> that... is there a way to do that or I am searching in vain? >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Tomas Mazukna >>>>>>>>> 678-557-3834 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> Tomas Mazukna >>> 678-557-3834 >>> >> >> > > > -- > Tomas Mazukna > 678-557-3834 >
