Hi Everyone, I need your advice and have some questions!
I'm looking into using Apache Storm - Trident for a scientific application to process sensor data in real time. Essentially, I have hundreds of sensors that are emitting data over time and I would like to create correlations between all of the sensors. I need to batch all of the sensor data based on time. So I need to group together all of the data from every sensor over a 1 second interval lets say, and process it together to apply my correlation algorithm. The way I see the Storm topology is each sensor is connected and sends data to a Spout. The spouts then send a batch of tuples to a combiner bolt to create a larger batch of tuples that all fall within a specific time partition. That larger patch of tuples are then sent to another bolt that runs my correlation algorithm and outputs data I need to save. I assume this type of topology can be highly parallel, where I can process multiple time partitioned data at once continuously. 1.) Is this type of topology possible? 2.) How does the combiner bolt know if all of the data from each spout has been received before it batches it all together before sending it off to the next bolt? 3.) After processing a batch of time partitioned data, does Storm automatically kill the thread and restarts a fresh instance? Or do I need to code memory clearing functions?
