Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)
Hi, amazing, that's since ever (commit 5943b9f1 by Doug Cutting). Yes, it's related to politeness, and it's correct - afaics. Generator.Selector implements 3 methods: map() - select unfetched entries and sort entries by decreasing score (more relevant ones first) partition() - ensure that all entries of one host end up in the same partion reduce() - applies limits (topN and "generate.max.count") Because reducers cannot communicate the limits have to be adjusted in relation to the number of reducers. For the per-host/ip limit this is not necessary because all URLs of one host/ip are in the same partition and hence in the same reducer. But the "global" limit topN must be adapted. If the URLs are evenly over multiple hosts/ips that's ok, but in your case (I assume all URLs belong to a single host) there will be exactly one partition which then contains less URLs. I don't know whether this can be fixed. Would need to adjust the limit if there are fewer partitions than expected. But the logging could be improved for sure. Cheers, Sebastian On 04/14/2016 11:11 AM, Karanjeet Singh wrote: > Thanks, Sebastian. > > This is solved now. I looked through the code and found that Nutch has a > limit placed on the count of host URLs which is defined by *topN / number > of reducer tasks*. Please refer here [0]. > > So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 / > 16). > > I am interested to know the reason for this. Is it due to politeness? > > [0]: > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141 > > Regards, > Karanjeet Singh > USC > > On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel> wrote: > >> Hi, >> >> I didn't anything wrong. Did you check whether >> CrawlDb entries are marked as "generated" >> by "_ngt_="? With generate.update.crawldb=true >> it may happen that after having run generate >> multiple times, only 62 unfetched and not-generated >> entries remain. >> >> Sebastian >> >> On 04/14/2016 03:31 AM, Karanjeet Singh wrote: >>> Hello, >>> >>> I am trying to crawl a website using Nutch on Hadoop cluster. I have >>> modified the crawl script to restrict the sizeFetchList to 1000 (which is >>> the topN value for nutch generate command). >>> >>> However, as I see, Nutch is only generating 62 URLs where the unfetched >> URL >>> count is 5,000 (approx). I am using the below command: >>> >>> nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D >>> mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D >>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D >>> mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN >> 1000 >>> -numFetchers 1 -noFilter >>> >>> Can anyone please look into this and let me know if I am missing >> something. >>> Please find the crawl configuration here [0]. >>> >>> [0]: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg= >>> >>> Thanks & Regards, >>> Karanjeet Singh >>> USC >>> ᐧ >>> >> >> > ᐧ >
Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)
Thanks, Sebastian. This is solved now. I looked through the code and found that Nutch has a limit placed on the count of host URLs which is defined by *topN / number of reducer tasks*. Please refer here [0]. So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 / 16). I am interested to know the reason for this. Is it due to politeness? [0]: https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141 Regards, Karanjeet Singh USC On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagelwrote: > Hi, > > I didn't anything wrong. Did you check whether > CrawlDb entries are marked as "generated" > by "_ngt_="? With generate.update.crawldb=true > it may happen that after having run generate > multiple times, only 62 unfetched and not-generated > entries remain. > > Sebastian > > On 04/14/2016 03:31 AM, Karanjeet Singh wrote: > > Hello, > > > > I am trying to crawl a website using Nutch on Hadoop cluster. I have > > modified the crawl script to restrict the sizeFetchList to 1000 (which is > > the topN value for nutch generate command). > > > > However, as I see, Nutch is only generating 62 URLs where the unfetched > URL > > count is 5,000 (approx). I am using the below command: > > > > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D > > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D > > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN > 1000 > > -numFetchers 1 -noFilter > > > > Can anyone please look into this and let me know if I am missing > something. > > Please find the crawl configuration here [0]. > > > > [0]: > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg= > > > > Thanks & Regards, > > Karanjeet Singh > > USC > > ᐧ > > > > ᐧ
Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)
Hi, I didn't anything wrong. Did you check whether CrawlDb entries are marked as "generated" by "_ngt_="? With generate.update.crawldb=true it may happen that after having run generate multiple times, only 62 unfetched and not-generated entries remain. Sebastian On 04/14/2016 03:31 AM, Karanjeet Singh wrote: > Hello, > > I am trying to crawl a website using Nutch on Hadoop cluster. I have > modified the crawl script to restrict the sizeFetchList to 1000 (which is > the topN value for nutch generate command). > > However, as I see, Nutch is only generating 62 URLs where the unfetched URL > count is 5,000 (approx). I am using the below command: > > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000 > -numFetchers 1 -noFilter > > Can anyone please look into this and let me know if I am missing something. > Please find the crawl configuration here [0]. > > [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf > > Thanks & Regards, > Karanjeet Singh > USC > ᐧ >
Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)
Hello, I am trying to crawl a website using Nutch on Hadoop cluster. I have modified the crawl script to restrict the sizeFetchList to 1000 (which is the topN value for nutch generate command). However, as I see, Nutch is only generating 62 URLs where the unfetched URL count is 5,000 (approx). I am using the below command: nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000 -numFetchers 1 -noFilter Can anyone please look into this and let me know if I am missing something. Please find the crawl configuration here [0]. [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf Thanks & Regards, Karanjeet Singh USC ᐧ