Hi, You could have a topology with spout scanning for files in the directory and bolts reading and processing the data in the file.
Create a Spout(with parallelism 1) that scans for files(of a particular extension) in the directory. Once files are available in the directory, the spout can rename them to a different extension and then emit the path of the new file to the bolt. By renaming the files to a different extension, you can make sure that the same files are not picked up in the next scan cycle(next invocation of nextTuple()). You can emit multiple filepaths by calling multiple emit() on collector. Create bolts(eg: reader bolt and data validator bolt with desired parallelism) that read contents from the files and process them according to your need. Since the spout has delegated the task of processing the files to the bolts, your system can scale by increasing the parallelism of the bolt. Parallelism of spout should remain 1 in this topology because if there are more than one spout, the same file may be picked by two spout instances, a situation which could lead to errors and duplicate processing of the files. In this system you also need to ensure that spout and bolts have access to the directory location. Thanks, Richards Peter.
