No, I have a separate process that runs periodically and determines which files haven't been processed before. Hooking directly into the rotation wasn't an option for me for unrelated reasons.
From: Gaurav Agarwal <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Monday, November 30, 2015 at 2:06 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Writing file to storm hdfs Hello Aaron, Please correct me if am wrong,You start processing files as soon as it is written and rotated by the hdfs bolt. On Dec 1, 2015 12:41 AM, "Aaron.Dossett" <[email protected]<mailto:[email protected]>> wrote: I recently had to solve a use case like that. I decided to track while files i had processed instead of records within each file. If a file is still open for writing you could ignore it and come back for it later, or insert it more than once if your process is idempotent. From: Gaurav Agarwal <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Monday, November 30, 2015 at 1:01 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Writing file to storm hdfs Hello In storm topology we r receiving tuples in millions from Kafka and we have to perform some calculations in bolt. Parallely we have bolt that starts writing into hdfs ,now we have parallelism hint for writing the file is 8. So 8 files will be there. Actually problem is once the snapshot data is enriched Nd written to multiple file nd completed,we have to trigger the other job that will copy the records from files into database. How can we find with multiple files created Nd bolt writing paraalely in files which is the last record written so that we can trigger nextjob.Any ideas?
