Re: limit nutch to all pages within a certain domain

2012-08-12 Thread Sourajit Basak
I proceeded like this .. 1. inject the urls 2. run generate 3. run fetch 4. run parse 5. run generate with topN 1000 .. repeat 3 4 ... 6. run generate with topN 1000 This seems to be fetching the inner pages. However, how is topN determined ? If I am crawling inside a domain, there will be

Re: limit nutch to all pages within a certain domain

2012-08-12 Thread Sebastian Nagel
However, how is topN determined? It's just the top N unfetched pages sorted by decreasing score. Pages will be re-fetched only after some larger amount of time, 30 days per default, see property db.fetch.interval.default. If I am crawling inside a domain, there will be links from almost every

Re: limit nutch to all pages within a certain domain

2012-08-12 Thread Sourajit Basak
Do I need to carry this iteration several times to crawl all the domains satisfactorily ? These domains may not have links among themselves. This is just to group related websites together. So, if I assume, on average each domain has (max) 100 links per page, and I have 5 domains; I need to set

Re: limit nutch to all pages within a certain domain

2012-08-12 Thread Sebastian Nagel
On 08/12/2012 07:14 PM, Sourajit Basak wrote: Do I need to carry this iteration several times to crawl all the domains satisfactorily ? Yes, you have to loop over generate-fetch-update cycles. In trunk there is a script src/bin/crawl which does this. These domains may not have links among

Re: limit nutch to all pages within a certain domain

2012-08-12 Thread Sourajit Basak
I think you mean generate-fetch-*parse*-update cycles. Per my understanding, the 'parse' phase finds out the number of outlinks at each step of the iteration. I will try with increasing the topN value at each step of the iteration. However Lets say, the domains being crawled are updated