Roshan Naik created STORM-1949:
----------------------------------

             Summary: Storm backpressure can cause spout to stop emitting and 
stall topology
                 Key: STORM-1949
                 URL: https://issues.apache.org/jira/browse/STORM-1949
             Project: Apache Storm
          Issue Type: Bug
            Reporter: Roshan Naik


Problem can be reproduced by this [Word count 
topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java]
within a IDE.
I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt 
instances.

The problem is more easily reproduced with WC topology as it causes an 
explosion of tuples due to splitting a sentence tuple into word tuples. As the 
bolts have to process more tuples than the  spout is producing, spout needs to 
operate slower.

The amount of time it takes for the topology to stall can vary.. but typically 
under 10 mins. 

*My theory:*  I suspect there is a race condition in the way ZK is being 
utilized to enable/disable back pressure. When congested (i.e pressure exceeds 
high water mark), the bolt's worker records this congested situation in ZK by 
creating a node. Once the congestion is reduced below the low water mark, it 
deletes this node. 
The spout's worker has setup a watch on the parent node, expecting a callback 
whenever there is change in the child nodes. On receiving the callback the 
spout's worker lists the parent node to check if there are 0 or more child 
nodes.... it is essentially trying to figure out the nature of state change in 
ZK to determine whether to throttle or not. Subsequently  it setsup another 
watch in ZK to keep an eye on future changes.

When there are multiple bolts, there can be rapid creation/deletion of these ZK 
nodes. Between the time the worker receives a callback and sets up the next 
watch.. many changes may have undergone in ZK which will go unnoticed by the 
spout. 

The condition that the bolts are no longer congested may not get noticed as a 
result of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to