I recently had to solve a use case like that.  I decided to track while files i 
had processed instead of records within each file.  If a file is still open for 
writing you could ignore it and come back for it later, or insert it more than 
once if your process is idempotent.

From: Gaurav Agarwal <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, November 30, 2015 at 1:01 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Writing file to storm hdfs


Hello

In storm topology we r receiving tuples in millions from Kafka and we have to 
perform some calculations in bolt. Parallely we have bolt that starts writing 
into hdfs ,now we have parallelism hint for writing the file is 8. So 8 files 
will be there.
Actually problem is once the snapshot data is enriched Nd written to multiple 
file nd completed,we have to trigger the other job that will copy the records 
from files into database.
How can we find with multiple files created Nd bolt writing paraalely in files 
which is the last record written so that we can trigger nextjob.Any ideas?

Reply via email to