Hi Folks,

I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of
controlling the  segment size and since a single segment is being created
which is very large for the capacity of my Hadoop cluster, I have a
available storage of ~3TB , but since Hadoop generates the spill*.out files
for this large segment which gets fetched for days ,I am running out of
disk space.

I figured , if the segment size were to be controlled then for each segment
the spills files would be deleted after the job for that segment was
completed, giving me a efficient use of the disk space.

I would like to know how I can generate multiple segments of a certain size
(or just fixed number )at each depth iteration .

Right now , looks like the Generator.java does needs to be modified as it
does not consider the number of segments , is that the right approach ? if
so can you please give me a few pointers of what logic I should be changing
, if this is not the right approach I would be happy to know if there is
any way to control , the number as well as the size of the generated
segments using the configuration/job submission parameters.

Thanks for your help!

Reply via email to