Hi Folks, I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of controlling the segment size and since a single segment is being created which is very large for the capacity of my Hadoop cluster, I have a available storage of ~3TB , but since Hadoop generates the spill*.out files for this large segment which gets fetched for days ,I am running out of disk space.
I figured , if the segment size were to be controlled then for each segment the spills files would be deleted after the job for that segment was completed, giving me a efficient use of the disk space. I would like to know how I can generate multiple segments of a certain size (or just fixed number )at each depth iteration . Right now , looks like the Generator.java does needs to be modified as it does not consider the number of segments , is that the right approach ? if so can you please give me a few pointers of what logic I should be changing , if this is not the right approach I would be happy to know if there is any way to control , the number as well as the size of the generated segments using the configuration/job submission parameters. Thanks for your help!