I have created an issue for this functionality:
https://issues.apache.org/jira/browse/NUTCH-2481
 
 

Sent: Thursday, December 14, 2017 at 2:07 PM
From: "Semyon Semyonov" <[email protected]>
To: "usernutch.apache.org" <[email protected]>
Subject: Usage previous stage HostDb data for generate(fetched deltas)
Dear all,

I plan to improve hostdb functionality to have a DB_FETCHED delta for generate 
stage.

Lets say for each website we have condition of generate while number of fetched 
< 150.
The problem is for some websites that condition will (almost)never be finished, 
because of its structure.

For example
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92.

To make it I plan to modify updatehostdb function and add delta variable in 
hostdatum for fetched.

Do you think it is a good idea to make it in such a way?

Semyon.

Reply via email to