Hi - you have been using Nutch for some time already so aren't you already familiar with generate.max.count configuration directive possibly combined with the -topN parameter for the Generator job? With generate.max.count the segment size depends on the number of distinct hosts or domains so it is not really trustworthy, the topN parameter is really strict.
Markus -----Original message----- > From:Meraj A. Khan <[email protected]> > Sent: Tuesday 7th October 2014 5:54 > To: [email protected] > Subject: Generated Segment Too Large > > Hi Folks, > > I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of > controlling the segment size and since a single segment is being created > which is very large for the capacity of my Hadoop cluster, I have a > available storage of ~3TB , but since Hadoop generates the spill*.out files > for this large segment which gets fetched for days ,I am running out of > disk space. > > I figured , if the segment size were to be controlled then for each segment > the spills files would be deleted after the job for that segment was > completed, giving me a efficient use of the disk space. > > I would like to know how I can generate multiple segments of a certain size > (or just fixed number )at each depth iteration . > > Right now , looks like the Generator.java does needs to be modified as it > does not consider the number of segments , is that the right approach ? if > so can you please give me a few pointers of what logic I should be changing > , if this is not the right approach I would be happy to know if there is > any way to control , the number as well as the size of the generated > segments using the configuration/job submission parameters. > > Thanks for your help! >

