Hi Semyon,

Maybe I'm missing the point, but I don't see why you would want to do this.
On one hand, if there is only 1 URL per cycle, why not fetch it? The cost is 
negligible.
On the other hand, imagine this scenario: You find the first link to some host 
from another host, and you crawl it. But it happens to be some "leaf" document 
that has no links (or maybe it has an homepage link only), so your delta 
condition is not satisfied. Later you find another link to this host from 
another host, this time to the homepage, where you can find all the "good" 
links, but you will not crawl it, because your delta condition is still not 
satisfied.
What am I missing?

        Yossi.

> -----Original Message-----
> From: Semyon Semyonov [mailto:[email protected]]
> Sent: 14 December 2017 15:08
> To: usernutch.apache.org <[email protected]>
> Subject: Usage previous stage HostDb data for generate(fetched deltas)
> 
> Dear all,
> 
> I plan to improve hostdb functionality to have a DB_FETCHED delta for generate
> stage.
> 
> Lets say for each website we have condition of generate while number of
> fetched < 150.
> The problem is for some websites that condition will (almost)never be 
> finished,
> because of its structure.
> 
> For example
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page
> ...etc.
> 
> I would like to add the delta condition for fetched that describes speed of 
> the
> process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
> Therefore in this case the process should stop on round 5 with total number of
> fetched equals to 92.
> 
> To make it I plan to modify updatehostdb function and add delta variable in
> hostdatum for fetched.
> 
> Do you think it is a good idea to make it in such a way?
> 
> Semyon.

Reply via email to