I proceeded like this ..
1. inject the urls
2. run generate
3. run fetch
4. run parse
5. run generate with topN 1000
.. repeat 3 4
...
6. run generate with topN 1000
This seems to be fetching the inner pages. However, how is topN determined
? If I am crawling inside a domain, there will be
However, how is topN determined?
It's just the top N unfetched pages sorted by decreasing score.
Pages will be re-fetched only after some larger amount of time,
30 days per default, see property db.fetch.interval.default.
If I am crawling inside a domain, there will be links from almost every
Do I need to carry this iteration several times to crawl all the domains
satisfactorily ?
These domains may not have links among themselves. This is just to group
related websites together. So, if I assume, on average each domain has
(max) 100 links per page, and I have 5 domains; I need to set
On 08/12/2012 07:14 PM, Sourajit Basak wrote:
Do I need to carry this iteration several times to crawl all the domains
satisfactorily ?
Yes, you have to loop over generate-fetch-update cycles. In trunk there is
a script src/bin/crawl which does this.
These domains may not have links among
I think you mean generate-fetch-*parse*-update cycles. Per my
understanding, the 'parse' phase finds out the number of outlinks at each
step of the iteration.
I will try with increasing the topN value at each step of the iteration.
However
Lets say, the domains being crawled are updated
5 matches
Mail list logo