I think I found the reason why I had this problem. As John suggested, I
mistakenly passed every input tuple to all bolts which is incorrect,
since when you increase the number of different bolts, the spouts have
to do more work and send tuples to more bolts. It had also something to
do with what Enno said about generating more messages (in fact I
duplicated them to be more precise). Now, when I increase the number of
spouts/bolts the system scales and the test finishes quicker than with
less workers.
Thank you all for your help.
Dimitris
On 26/07/2015 10:44 μμ, John Reilly wrote:
Could you share the code you are using to create the StormTopology?
From the storm UI screenshots above, it looks like every tuple emitted
from the spout tasks is sent to 4 different bolts instances (for the
spouts, the transferred is 4x the number emitted). I'm guessing that
each spout instance is emitting the tuple to each of worker1 through
worker4. Is this what you expect the code to do? Also, from the UI
screenshot, worker1 and worker3 are > 90% capacity.
I'm not sure if I understand your goal. If it is just to push as many
tuples as possible through the topology without any particular field
grouping, etc you could increase the throughput by making sure
components are co-located on the same worker. By ensuring the
parallelism of the spout and each of the workers is the same as the
number of workers (assuming EvenScheduler) you would end up with one
instance of each component on each worker. If you use locaOrShuffle
grouping, you will get rid of a lot of overhead and should be able to
increase throughput. It really depends on what you are trying to
achieve with your topology though. Obviously this will not work if you
need to use a field grouping.
On Sun, Jul 26, 2015 at 12:04 PM Dimitris Sarlis <[email protected]
<mailto:[email protected]>> wrote:
No, I have a config parameter which changes how many random
numbers are generated by the bolt's execute method to simulate a
heavier task. The total number of messages is controlled by
another parameter which I keep to the same value across my
experiments.
On 26/07/2015 07:09 μμ, Enno Shioji wrote:
I mean could that be by mistake, you are generating more messages
as you change the config, so the total test time just appears as
if there is no improvement?
On Sun, Jul 26, 2015 at 5:08 PM, Enno Shioji <[email protected]
<mailto:[email protected]>> wrote:
This may be a silly guess, but you are not simply generating
proportionally more messages as you change the config right?
On Sun, Jul 26, 2015 at 4:53 PM, Dimitris Sarlis
<[email protected] <mailto:[email protected]>> wrote:
Kashyap,
I put logger before and after emit in each bolt. In
spouts it's not so easy because I'm using the predefined
class KafkaSpout. See the attached images from a test
execution. I used 1 spout with parallelism 8 and 4 bolts
with parallelism 2. I also include a screenshot from a
bolt's log where you can see messages like: "Sending
record mpla mpla" and "After emit". These messages are
written before and after each emit in a bolt.
Dimitris
On 26/07/2015 06:24 μμ, Kashyap Mhaisekar wrote:
Can you put loggers before and after emit () in each
bolt/spout?
Can you share Storm UI screenshots ?
Thanks
Kashyap
On Sun, Jul 26, 2015, 10:08 Dimitris Sarlis
<[email protected] <mailto:[email protected]>> wrote:
Hi Harsha,
1. the number of topic partitions is set every time
to the total number
of spouts I'm using.
2. I have checked that data from the kafka producer
are distributed into
all of these partitions
3. I've tried from 4 to 20
4. 1000
5. This topology is just for some testing. Spouts
get data from Kafka
and then dispatch them to bolts. There if the record
has not been
processed before, each bolt generates some random
numbers and then it
selects another bolt to send the record appended
with a "!". If the
record has been processed before (it has a "!" in
the end) then just
generate some random numbers.
6. No
7. No
Dimitris
On 26/07/2015 05:52 μμ, Harsha wrote:
> Hi Dimitris,
>
> 1. how many topic partitions you've
> 2. make sure you are distributing data from kafka
producer side into all
> of these partitions
> 3. whats your kafakspout parallelism set to
> 4. whats you topology.max.spout.pending set to
> 5. if you can , briefly describe what topology is
doing.
> 6. are you seeing anything under failed column in
Stom UI.
> 7. any errors in storm topology logs.
>
> Thanks,
> Harsha
>
> On Sat, Jul 25, 2015, at 05:29 AM, Dimitris Sarlis
wrote:
>> Hi all,
>>
>> I'm trying to run a topology in Storm and I am
facing some scalability
>> issues. Specifically, I have a topology where
KafkaSpouts read from a
>> Kafka queue and emit messages to bolts which are
connected with each
>> other through directGrouping. (Each bolt is
connected with itself as
>> well as with each one of the other bolts). Spouts
subscribe to bolts
>> with shuffleGrouping. I observe that when I
increase the number of
>> spouts and bolts proportionally, I don't get the
speedup I'm expecting
>> to. In fact, my topology seems to run slower and
for the same amount of
>> data, it takes more time to complete. For
example, when I increase
>> spouts from 4->8 and bolts from 4->8, it takes
longer to process the
>> same amount of kafka messages.
>>
>> Any ideas why this is happening? Thanks in advance.
>>
>> Best,
>> Dimitris Sarlis