I am right now using Storm in an unorthodox way. While I am happy with the 
initial prototype I would like to know your opinion on potential problems I am 
overlooking.In our company we need to execute a cascade of image processing 
steps (bolts) on relatively large input data. A single input file (thus tuple) 
can range between 100MB and 100 GB. Of course, with such input size, the data 
is not placed in the stream but only a reference to it (e.g., file name). This 
is of course the first non-conformity with the intent of storm. The second one 
is long execution time of a single bolt (20-30 min) due mainly to the large 
size of an individual compute unit (the file). 
By adjusting the bolt heartbeat timeout and the expected ACK timeout in the 
spout, I can convince Storm to get the files processed. Of course, I need to 
take care of data locality of which the spout must be aware. On each worker (1 
per machine) a spout would fire only touples with local input files. So in a 
sense, I mimic the functionality of HDFS.
There will be a time when the input files will decrease in size and increase in 
number, but until the algorithm developers write code with parallel 
distribution in mind, this Storm solution with large input files will have to 
do. 
Or does it? The advantage of the solution is that is downscales great to single 
machine environment and scales up fantastic (as long as I keep an eye on data 
locality). The disadvantage is that I am right now using Storm "just" as a 
realtime job pipeline scheduler with relatively small input data count (50 
-1000). 
Are there better solutions for this specific setup of should I just be happy 
with what works? With this "misuse" will I run into unexpected problems?
Thanks for reading so much.
- Daniel

Reply via email to