Hi,

I am developing a benchmark on Storm 0.9.4. The benchmark consists three bolts 
which can be parallelized (i.e., by increasing the number of executors) without 
affecting the semantics of the application. Although the benchmark does filter 
part of its input data, it has a deterministic behavior. I should get the same 
number of output events across multiple runs with the data set I use during the 
experiments. The bolts were developed following the reliability guidelines 
provided by Storm developers [1] and were extended from BaseBasicBolt. While a 
level of parallelism one worked as expected, with increasing numbers of 
parallelism (2,4, and 8), I experienced event loss at the output. The event 
count was taken at the last bolt of the topology based on a first punctuation 
and a last punctuation sent through the benchmark application. One of the 
reasons for event loss was that the last punctuation arrived at the output 
before some other events have been arrived at the output bolt giving a lesser 
number of events than what I should get. This is behavior is typical in any 
stream processing application. The solution was to delay sending the last 
punctuation event 20 seconds after sending all the events. Furthermore, there 
was a text compression step which was CPU intensive in the final bolt of the 
topology. This heavy bolt gave rise to additional event loss. Once the text 
compression step was removed, the benchmark's storm topology was running as 
expected without any event loss irrespective of the level of parallelism used 
(even for a parallelism of 32).

My question is does Storm have some way of warning the user/developer of a 
streaming application that there was such event losses? 

[1] https://storm.apache.org/documentation/Guaranteeing-message-processing.html

Thanks,
Miyuru

Reply via email to