Hi,

You could have a topology with spout scanning for files in the directory
and bolts reading and processing the data in the file.

Create a Spout(with parallelism 1) that scans for files(of a particular
extension) in the directory. Once files are available in the directory, the
spout can rename them to a different extension and then emit the path of
the new file to the bolt. By renaming the files to a different extension,
you can make sure that the same files are not picked up in the next scan
cycle(next invocation of nextTuple()). You can emit multiple filepaths by
calling multiple emit() on collector.

Create bolts(eg: reader bolt and data validator bolt with desired
parallelism) that read contents from the files and process them according
to your need.

Since the spout has delegated the task of processing the files to the
bolts, your system can scale by increasing the parallelism of the bolt.
Parallelism of spout should remain 1 in this topology because if there are
more than one spout, the same file may be picked by two spout instances, a
situation which could lead to errors and duplicate processing of the files.
In this system you also need to ensure that spout and bolts have access to
the directory location.

Thanks,
Richards Peter.

Reply via email to