RE: Generated Segment Too Large

2014-10-07 Thread Markus Jelsma
Hi - you have been using Nutch for some time already so aren't you already 
familiar with generate.max.count configuration directive possibly combined with 
the -topN parameter for the Generator job? With generate.max.count the segment 
size depends on the number of distinct hosts or domains so it is not really 
trustworthy, the topN parameter is really strict.

Markus

 
 
-Original message-
 From:Meraj A. Khan mera...@gmail.com
 Sent: Tuesday 7th October 2014 5:54
 To: user@nutch.apache.org
 Subject: Generated Segment Too Large
 
 Hi Folks,
 
 I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of
 controlling the  segment size and since a single segment is being created
 which is very large for the capacity of my Hadoop cluster, I have a
 available storage of ~3TB , but since Hadoop generates the spill*.out files
 for this large segment which gets fetched for days ,I am running out of
 disk space.
 
 I figured , if the segment size were to be controlled then for each segment
 the spills files would be deleted after the job for that segment was
 completed, giving me a efficient use of the disk space.
 
 I would like to know how I can generate multiple segments of a certain size
 (or just fixed number )at each depth iteration .
 
 Right now , looks like the Generator.java does needs to be modified as it
 does not consider the number of segments , is that the right approach ? if
 so can you please give me a few pointers of what logic I should be changing
 , if this is not the right approach I would be happy to know if there is
 any way to control , the number as well as the size of the generated
 segments using the configuration/job submission parameters.
 
 Thanks for your help!
 


Re: Generated Segment Too Large

2014-10-07 Thread Meraj A. Khan
Markus,

I have been using Nucth for a while , but I wasnt clear about this issue,
thank you for reminding me that this is Nucth 101 :)

I will go ahead and use topN as the segment size control mechanism,
although I have one question regarding topN , i.e if I have topN value of
1000 and if there are more than topN , lets say 2000 URLs that are
unfetched at that point of time  , the remaining 1000 would be addressed in
the subsequent Fetch phase, meaning nothing is discarded or felt unfetched ?





On Tue, Oct 7, 2014 at 3:46 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - you have been using Nutch for some time already so aren't you already
 familiar with generate.max.count configuration directive possibly combined
 with the -topN parameter for the Generator job? With generate.max.count the
 segment size depends on the number of distinct hosts or domains so it is
 not really trustworthy, the topN parameter is really strict.

 Markus



 -Original message-
  From:Meraj A. Khan mera...@gmail.com
  Sent: Tuesday 7th October 2014 5:54
  To: user@nutch.apache.org
  Subject: Generated Segment Too Large
 
  Hi Folks,
 
  I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way
 of
  controlling the  segment size and since a single segment is being created
  which is very large for the capacity of my Hadoop cluster, I have a
  available storage of ~3TB , but since Hadoop generates the spill*.out
 files
  for this large segment which gets fetched for days ,I am running out of
  disk space.
 
  I figured , if the segment size were to be controlled then for each
 segment
  the spills files would be deleted after the job for that segment was
  completed, giving me a efficient use of the disk space.
 
  I would like to know how I can generate multiple segments of a certain
 size
  (or just fixed number )at each depth iteration .
 
  Right now , looks like the Generator.java does needs to be modified as it
  does not consider the number of segments , is that the right approach ?
 if
  so can you please give me a few pointers of what logic I should be
 changing
  , if this is not the right approach I would be happy to know if there is
  any way to control , the number as well as the size of the generated
  segments using the configuration/job submission parameters.
 
  Thanks for your help!
 



Generated Segment Too Large

2014-10-06 Thread Meraj A. Khan
Hi Folks,

I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of
controlling the  segment size and since a single segment is being created
which is very large for the capacity of my Hadoop cluster, I have a
available storage of ~3TB , but since Hadoop generates the spill*.out files
for this large segment which gets fetched for days ,I am running out of
disk space.

I figured , if the segment size were to be controlled then for each segment
the spills files would be deleted after the job for that segment was
completed, giving me a efficient use of the disk space.

I would like to know how I can generate multiple segments of a certain size
(or just fixed number )at each depth iteration .

Right now , looks like the Generator.java does needs to be modified as it
does not consider the number of segments , is that the right approach ? if
so can you please give me a few pointers of what logic I should be changing
, if this is not the right approach I would be happy to know if there is
any way to control , the number as well as the size of the generated
segments using the configuration/job submission parameters.

Thanks for your help!