Markus,

I have been using Nucth for a while , but I wasnt clear about this issue,
thank you for reminding me that this is Nucth 101 :)

I will go ahead and use topN as the segment size control mechanism,
although I have one question regarding topN , i.e if I have topN value of
1000 and if there are more than topN , lets say 2000 URLs that are
unfetched at that point of time  , the remaining 1000 would be addressed in
the subsequent Fetch phase, meaning nothing is discarded or felt unfetched ?





On Tue, Oct 7, 2014 at 3:46 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hi - you have been using Nutch for some time already so aren't you already
> familiar with generate.max.count configuration directive possibly combined
> with the -topN parameter for the Generator job? With generate.max.count the
> segment size depends on the number of distinct hosts or domains so it is
> not really trustworthy, the topN parameter is really strict.
>
> Markus
>
>
>
> -----Original message-----
> > From:Meraj A. Khan <mera...@gmail.com>
> > Sent: Tuesday 7th October 2014 5:54
> > To: user@nutch.apache.org
> > Subject: Generated Segment Too Large
> >
> > Hi Folks,
> >
> > I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way
> of
> > controlling the  segment size and since a single segment is being created
> > which is very large for the capacity of my Hadoop cluster, I have a
> > available storage of ~3TB , but since Hadoop generates the spill*.out
> files
> > for this large segment which gets fetched for days ,I am running out of
> > disk space.
> >
> > I figured , if the segment size were to be controlled then for each
> segment
> > the spills files would be deleted after the job for that segment was
> > completed, giving me a efficient use of the disk space.
> >
> > I would like to know how I can generate multiple segments of a certain
> size
> > (or just fixed number )at each depth iteration .
> >
> > Right now , looks like the Generator.java does needs to be modified as it
> > does not consider the number of segments , is that the right approach ?
> if
> > so can you please give me a few pointers of what logic I should be
> changing
> > , if this is not the right approach I would be happy to know if there is
> > any way to control , the number as well as the size of the generated
> > segments using the configuration/job submission parameters.
> >
> > Thanks for your help!
> >
>

Reply via email to