RE: Generated Segment Too Large
Hi - you have been using Nutch for some time already so aren't you already familiar with generate.max.count configuration directive possibly combined with the -topN parameter for the Generator job? With generate.max.count the segment size depends on the number of distinct hosts or domains so it is not really trustworthy, the topN parameter is really strict. Markus -Original message- From:Meraj A. Khan mera...@gmail.com Sent: Tuesday 7th October 2014 5:54 To: user@nutch.apache.org Subject: Generated Segment Too Large Hi Folks, I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of controlling the segment size and since a single segment is being created which is very large for the capacity of my Hadoop cluster, I have a available storage of ~3TB , but since Hadoop generates the spill*.out files for this large segment which gets fetched for days ,I am running out of disk space. I figured , if the segment size were to be controlled then for each segment the spills files would be deleted after the job for that segment was completed, giving me a efficient use of the disk space. I would like to know how I can generate multiple segments of a certain size (or just fixed number )at each depth iteration . Right now , looks like the Generator.java does needs to be modified as it does not consider the number of segments , is that the right approach ? if so can you please give me a few pointers of what logic I should be changing , if this is not the right approach I would be happy to know if there is any way to control , the number as well as the size of the generated segments using the configuration/job submission parameters. Thanks for your help!
Re: Generated Segment Too Large
Markus, I have been using Nucth for a while , but I wasnt clear about this issue, thank you for reminding me that this is Nucth 101 :) I will go ahead and use topN as the segment size control mechanism, although I have one question regarding topN , i.e if I have topN value of 1000 and if there are more than topN , lets say 2000 URLs that are unfetched at that point of time , the remaining 1000 would be addressed in the subsequent Fetch phase, meaning nothing is discarded or felt unfetched ? On Tue, Oct 7, 2014 at 3:46 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - you have been using Nutch for some time already so aren't you already familiar with generate.max.count configuration directive possibly combined with the -topN parameter for the Generator job? With generate.max.count the segment size depends on the number of distinct hosts or domains so it is not really trustworthy, the topN parameter is really strict. Markus -Original message- From:Meraj A. Khan mera...@gmail.com Sent: Tuesday 7th October 2014 5:54 To: user@nutch.apache.org Subject: Generated Segment Too Large Hi Folks, I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of controlling the segment size and since a single segment is being created which is very large for the capacity of my Hadoop cluster, I have a available storage of ~3TB , but since Hadoop generates the spill*.out files for this large segment which gets fetched for days ,I am running out of disk space. I figured , if the segment size were to be controlled then for each segment the spills files would be deleted after the job for that segment was completed, giving me a efficient use of the disk space. I would like to know how I can generate multiple segments of a certain size (or just fixed number )at each depth iteration . Right now , looks like the Generator.java does needs to be modified as it does not consider the number of segments , is that the right approach ? if so can you please give me a few pointers of what logic I should be changing , if this is not the right approach I would be happy to know if there is any way to control , the number as well as the size of the generated segments using the configuration/job submission parameters. Thanks for your help!
Generated Segment Too Large
Hi Folks, I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of controlling the segment size and since a single segment is being created which is very large for the capacity of my Hadoop cluster, I have a available storage of ~3TB , but since Hadoop generates the spill*.out files for this large segment which gets fetched for days ,I am running out of disk space. I figured , if the segment size were to be controlled then for each segment the spills files would be deleted after the job for that segment was completed, giving me a efficient use of the disk space. I would like to know how I can generate multiple segments of a certain size (or just fixed number )at each depth iteration . Right now , looks like the Generator.java does needs to be modified as it does not consider the number of segments , is that the right approach ? if so can you please give me a few pointers of what logic I should be changing , if this is not the right approach I would be happy to know if there is any way to control , the number as well as the size of the generated segments using the configuration/job submission parameters. Thanks for your help!