How about NUTCH-2368's variable generate.max.count based on HostDB data? Regards, Markus
[1] https://issues.apache.org/jira/browse/NUTCH-2368 -----Original message----- > From:Semyon Semyonov <[email protected]> > Sent: Monday 23rd October 2017 15:51 > To: [email protected] > Subject: Ways of limit pages per host. generate.max.count, hostdb, > scoring-depth > > Hi, > > Im looking for the best way of restriction by amount of pages crawled per > host. I have a list of hosts to crawl, lets say M hosts and I would like to > limit crawling on each host as MaxPages. > The external links are turned off for the crawling processes. > > My own proposal can be found at 3) > > 1)Using https://www.mail-archive.com/[email protected]/msg10245.html > We know the size of the cluster(number of Nodes) and now the size of the > list(M). > If we divide M/(number of Nodes in the cluster * number of fetches per Node) > we can get the total amount of rounds for first level crawling(K). > Then we multiply this parameter on necessary number of level for the > website(N = 2,3,4...) depending on how deep we want to get to the specific > website. > Lets say to crawl all the list we need to have K = 500 rounds, we want to > crawl each website up to 4th level N= 4, therefore the total amount of rounds > KN = 2000 > Combining with generate.max.count = MaxPages we get maximum pages MaxPages * > N. > Problem: the process should be smooth enough to guarantee the full list crawl > for K rounds. Potential problems with crawling process and/or Hadoop cluster. > > 2) The second approach is to use hostdb > https://www.mail-archive.com/[email protected]/msg14330.html[https://www.mail-archive.com/[email protected]/msg14330.html] > Problem : that asks for additional computations for hostdb + workaround with > the black list > > 3) My own solution, it is a bit tricky. > Using scoring-depth plugin extension and generate.min.score config. > > That plugin set up the weights of linked pages as ParrentWeight/Number of > linked pages. The initial weight equals to 1 by default. > > My idea that we can estimate the maximum amount of page for the host. > To illustrate, there are several ways to get 1/4 weights for a host(5 pages, > 5 pages and 7 pages). > > 1 > / / \ \ > / / \ \ > / / \ \ > 1/4 1/4 1/4 1/4 > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ > 1/4 1/4 > > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ / \ > 1/4 1/4 1/4 1/4 > > The last tree gives maximum amount of pages with weight of 1/4( 3 levels each > one sums up to 1). Total sum = 7. > The idea behind it is the maximum amount of links are given with the deepest > tree.The deepest tree can be factorized on prime factors of the final weight. > > For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the > total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7. > For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13 > For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 > > The calculator: > http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22] > > Problem : the score can be affected by other plugins. > > Thanks. > > Semyon. >

