Re: crawl time for depth param 50 and topN not passed

Tejas Patil Sat, 06 Apr 2013 04:23:53 -0700

On Sat, Apr 6, 2013 at 3:31 AM, David Philip <[email protected]>wrote:


> Hi Sebastian,
>
>    yes, its taking 2-3 days. Ok I will consider to giving incremental depth
> and check stats every step. Thanks.
> Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and
> have removed  +.
>
> what should be the depth for next recrawl case?  I mean this question: say
> I had crawldb crawled with depth param 5 only and topN 10.. Now I find that
> 3-4 urls were deleted and 4 were modified.. I don’t know which those urls
> are. So what I am doing is re-initate crawl.  At this time, what I should
> give depth param?
>
Once those urls enter the crawldb, crawler won't need to reach those from
their parent page again. The crawler has stored those urls in its crawldb /
webtable. With each url, a re-crawl interval is maintained (which is by
default set to 30 days). Crawler wont pick a url for crawling if its fetch
interval aint elapsed since the last time the url was fetched. Crawl
interval can be configured using the db.fetch.interval.default property in
nutch-site.xml.

>
> Thanks - David
>
>
>
> On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel <
> [email protected]
> > wrote:
>
> > Hi David,
> >
> > >  What can be crawl time for very big site, given depth param as 50,
> topN
> > > default(not passed ) and default fetch interval as 2mins..
> > afaik, the default of topN is Long.MAX_VALUE which is very large.
> > So, the size of the crawl is mainly limited by the number of links you
> get.
> > Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
> > very defensive) your crawl will take a long time.
> >
> > Try to start with small values for depth and topN, e.g. 3 and 50.
> > Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
> > and check how the numbers of fetch/unfetched/gone/etc. URLs increase
> > to get a feeling which values make sense for your crawl.
> >
> > > Case: Crawling website spicemobilephones.co.in, and in the
> > > regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
> > This doesn't look like a valid Java regex.
> > Did you remove these lines:
> >   # accept anything else
> >   +.
> >
> > Sebastian
> >
>

Re: crawl time for depth param 50 and topN not passed

Reply via email to