Hi Markus,
What is the default value for topN when it is not passed through
command? I mean simply passing the depth param and no topN - (bin/nutch
crawl urls -dir crawl -depth 3)
Also,If the depth is number of crawl cycles, can you please brief me on the
logic behind it to crawl all the 5 URL when depth param passed is 3 (-depth
3)?
Thanks
David.
On Fri, Dec 21, 2012 at 6:25 PM, Markus Jelsma
<[email protected]>wrote:
> Hi - Depth means how many crawl cycles are executes and topN means how
> many URL's per cycle are selected.
>
> -----Original message-----
> > From:David Philip <[email protected]>
> > Sent: Fri 21-Dec-2012 13:50
> > To: [email protected]
> > Subject: Difference in params - depth and topN
> >
> > Hello All,
> >
> > There is a site that has total 5 URLS.
> >
> >
> > - When this site is crawled with input param for depth 3 , all 5 sites
> > are crawled in one shot.
> >
> > - And when it is crawled with params : depth 1 topN 5 TWO times,
> for
> > this first time only one URL is crawled and second time rest 4 are
> crawled.
> >
> > - And when params: depth 1 topN 3 - after 3 times it crawled all the
> 5
> > sites.
> >
> > Didn't understand what does these 2 parameters mean. Can anyone brief or
> > redirect to url that explains this? Below are the list of url and readdb
> > stats.
> >
> > *stats:*
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls: 5
> > status 2 (db_fetched): 5
> > CrawlDb statistics: done
> >
> > *URLS : *
> > http://liveforyou.blogspot.in/
> > http://liveforyou.blogspot.in/2012/12/blogging.html
> > http://liveforyou.blogspot.in/2011/09/test.html
> > http://liveforyou.blogspot.in/2012_12_01_archive.html
> > http://liveforyou.blogspot.in/2011_09_01_archive.html
> >
>