Thanks for the suggestion.
Could you explain how can I use it in the crawling process?

Should I call generate with a specific parameter? It is not really clear from 
the issue.

I use Nutch 1.13.
 

Sent: Monday, October 23, 2017 at 3:57 PM
From: "Markus Jelsma" <[email protected]>
To: "[email protected]" <[email protected]>
Subject: RE: Ways of limit pages per host. generate.max.count, hostdb, 
scoring-depth
How about NUTCH-2368's variable generate.max.count based on HostDB data?

Regards,
Markus

[1] https://issues.apache.org/jira/browse/NUTCH-2368

-----Original message-----
> From:Semyon Semyonov <[email protected]>
> Sent: Monday 23rd October 2017 15:51
> To: [email protected]
> Subject: Ways of limit pages per host. generate.max.count, hostdb, 
> scoring-depth
>
> Hi,
>
> Im looking for the best way of restriction by amount of pages crawled per 
> host. I have a list of hosts to crawl, lets say M hosts and I would like to 
> limit crawling on each host as MaxPages.
> The external links are turned off for the crawling processes.
>
> My own proposal can be found at 3)
>  
> 1)Using 
> https://www.mail-archive.com/[email protected]/msg10245.html[https://www.mail-archive.com/[email protected]/msg10245.html]
> We know the size of the cluster(number of Nodes) and now the size of the 
> list(M). 
> If we divide M/(number of Nodes in the cluster * number of fetches per Node) 
> we can get the total amount of rounds for first level crawling(K).
> Then we multiply this parameter on necessary number of level for the 
> website(N = 2,3,4...) depending on how deep we want to get to the specific 
> website.
> Lets say to crawl all the list we need to have K = 500 rounds, we want to 
> crawl each website up to 4th level N= 4, therefore the total amount of rounds 
> KN = 2000
> Combining with  generate.max.count = MaxPages we get maximum pages MaxPages * 
> N. 
> Problem: the process should be smooth enough to guarantee the full list crawl 
> for K rounds. Potential problems with crawling process and/or Hadoop cluster.
>  
> 2) The second approach is to use hostdb 
> https://www.mail-archive.com/[email protected]/msg14330.html[https://www.mail-archive.com/[email protected]/msg14330.html][https://www.mail-archive.com/[email protected]/msg14330.html[https://www.mail-archive.com/[email protected]/msg14330.html]]
> Problem : that asks for additional computations for hostdb + workaround with 
> the black list
>  
> 3) My own solution, it is a bit tricky.
> Using scoring-depth plugin extension and generate.min.score config.
>  
> That plugin set up the weights of linked pages as ParrentWeight/Number of 
> linked pages. The initial weight equals to 1 by default.
>  
> My idea that we can estimate the maximum amount of page for the host.
> To illustrate, there are several ways to get 1/4 weights for a host(5 pages, 
> 5 pages and 7 pages). 
>  
>         1
>    /   / \     \
>   /   /   \     \ 
>  /   /     \     \
> 1/4   1/4     1/4  1/4
>         1
>        / \
>       /   \
>      /     \
>     1/2     1/2
>             / \
>           1/4 1/4
>     
>         1
>        / \
>       /   \
>      /     \
>     1/2     1/2
>    / \     / \
>   1/4 1/4 1/4 1/4
>
> The last tree gives maximum amount of pages with weight of 1/4( 3 levels each 
> one sums up to 1). Total sum  = 7.
> The idea behind it is the maximum amount of links are given with the deepest 
> tree.The deepest tree can be factorized on prime factors of the final weight.
>  
> For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the 
> total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7.
> For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13
> For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 
>
> The calculator: 
> http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22][http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22]]
>  
> Problem : the score can be affected by other plugins.
>  
> Thanks.
>
> Semyon.
>

Reply via email to