Hi Guys,
I am new to Storm and am looking at using it on a project for processing
NetFlow data. During my initial experimentation with topologies I have
encountered some behaviour that I am unsure of.
My topology is as follows
Spout
|
SplitterBolt
| |
BoltA BoltB
| |
| BoltC
| |
| BoltD
| |
OutputBolt
My tuples consist of two Java classes, a parent object and a collection
of child objects contained in a List attribute called Scores in the
parent object (a Java List).
The SplitterBolt sends each Tuple down both paths. Each of the bolts A-D
in the topology tests some attribute in the parent object and then adds
a relevant entry to the list to reflect the outcome of the test. This is
not my final design but it does reflect the route I will be following as
I progress and I will be adding more paths as I proceed.
When I run the above topology with a single worker I note that two
instances of each tuple arrives at the OutputBolt a few milliseconds
apart. In each case the collection of values in the Score List is
exactly the same containing scores from both paths.
If I change the number of workers in the topology to 2 I see a different
outcome. I still see two instances of each tuple arrive at the
OutputBolt but this time the entries in the Score List are different and
either contain only scores from the distinct paths or scores from both.
My questions are:
1. in the first case (single worker) it appears that the tuples sent
down the two different paths are the same object in memory even though
two versions of the tuple move through the topology - is this correct?
2. in the second case (two workers) I am guessing that when tuples move
between workers and are updated then they are indeed different objects
(they are after all in different JVM's as I understand). Is this also
correct?
Any insight would be appreciated.
Regards
M