Markus, I have been using Nucth for a while , but I wasnt clear about this issue, thank you for reminding me that this is Nucth 101 :)
I will go ahead and use topN as the segment size control mechanism, although I have one question regarding topN , i.e if I have topN value of 1000 and if there are more than topN , lets say 2000 URLs that are unfetched at that point of time , the remaining 1000 would be addressed in the subsequent Fetch phase, meaning nothing is discarded or felt unfetched ? On Tue, Oct 7, 2014 at 3:46 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hi - you have been using Nutch for some time already so aren't you already > familiar with generate.max.count configuration directive possibly combined > with the -topN parameter for the Generator job? With generate.max.count the > segment size depends on the number of distinct hosts or domains so it is > not really trustworthy, the topN parameter is really strict. > > Markus > > > > -----Original message----- > > From:Meraj A. Khan <mera...@gmail.com> > > Sent: Tuesday 7th October 2014 5:54 > > To: user@nutch.apache.org > > Subject: Generated Segment Too Large > > > > Hi Folks, > > > > I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way > of > > controlling the segment size and since a single segment is being created > > which is very large for the capacity of my Hadoop cluster, I have a > > available storage of ~3TB , but since Hadoop generates the spill*.out > files > > for this large segment which gets fetched for days ,I am running out of > > disk space. > > > > I figured , if the segment size were to be controlled then for each > segment > > the spills files would be deleted after the job for that segment was > > completed, giving me a efficient use of the disk space. > > > > I would like to know how I can generate multiple segments of a certain > size > > (or just fixed number )at each depth iteration . > > > > Right now , looks like the Generator.java does needs to be modified as it > > does not consider the number of segments , is that the right approach ? > if > > so can you please give me a few pointers of what logic I should be > changing > > , if this is not the right approach I would be happy to know if there is > > any way to control , the number as well as the size of the generated > > segments using the configuration/job submission parameters. > > > > Thanks for your help! > > >