mapred.map.tasks
property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? Our spider is distributed among 3 machines. What value is most preferred for this parameter in our case? Which other factors may have effect on most preferred value of this parameter?
[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )
[ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375300 ] Christophe Noel commented on NUTCH-173: --- We are TENS of nutch users using this precious patch. Most of nutch users are not making whole-web search engine (too much hardware needed) but are willing to develop dedicated search engines. We crawl sometimes 1000, sometimes 25000 web servers and it really slow down the crawling with 25000 entries in prefix-urlfilter. This patch is NEEDED ! Christophe Noël CETIC Belgium PerHost Crawling Policy ( crawl.ignore.external.links ) --- Key: NUTCH-173 URL: http://issues.apache.org/jira/browse/NUTCH-173 Project: Nutch Type: New Feature Components: fetcher Versions: 0.7.1, 0.7, 0.8-dev Reporter: Philippe EUGENE Priority: Minor Attachments: patch.txt, patch08.txt There is two major way of crawl in Nutch. Intranet Crawl : forbidden all, allow somes few host Whole-web crawl : allow all, forbidden few thinks I propose a third type of crawl. Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp. I made two patch for : 0.7, 0.7.1 and 0.8-dev I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default. By default this new feature don't modify the behavior of nutch crawler. When you setup this property to true, the crawler don't fetch external links of the host. So the crawl is limited to the host that you inject at the beginning at the crawl. I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy. I post two patch. Sorry for my very poor english -- Philippe -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host
[ http://issues.apache.org/jira/browse/NUTCH-250?page=all ] Doug Cutting resolved NUTCH-250: Fix Version: 0.8-dev Resolution: Fixed Assign To: Doug Cutting I just committed this. Thanks, Rod. Generate to log truncation caused by generate.max.per.host -- Key: NUTCH-250 URL: http://issues.apache.org/jira/browse/NUTCH-250 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Rod Taylor Assignee: Doug Cutting Fix For: 0.8-dev Attachments: nutch-generate-truncatelog.patch LOG.info() hosts which have had their generate lists truncated. This can inform admins about potential abusers or excessively large sites that they may wish to block with rules. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mapred.map.tasks
Anton Potehin wrote: We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that all of the task slots are used. More tasks makes recovery faster when a task fails, since less needs to be redone. Our spider is distributed among 3 machines. What value is most preferred for this parameter in our case? Which other factors may have effect on most preferred value of this parameter? When fetching, the total number of hosts you're fetching can also be a factor, since fetch tasks are hostwise-disjoint. If you're only fetching a few hosts, then a large value for mapred.map.tasks will cause there to be a few big fetch tasks and a bunch of empty ones. This could be a problem if the big ones are not allocated evenly among your nodes. I generally use 5*numHosts*mapred.tasktracker.tasks.maximum. Doug