Re: crawl time for depth param 50 and topN not passed

David Philip Sun, 07 Apr 2013 22:43:33 -0700

Hi Tejas,

   Thank you..So what I understand is, when we initiate *re-crawl* with
depth 1 , it will check for all the urls due for fetch in the first loop
itself and fetch all the urls due for fetch. Correct?


What I basically wanted to know is, as for the first fresh crawl depth was
20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same
crawl db and same solr host (nutch 1.6),  what should be the depth?

Thanks -David




On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil <[email protected]>wrote:

> On Sat, Apr 6, 2013 at 3:31 AM, David Philip <[email protected]
> >wrote:
>
> > Hi Sebastian,
> >
> >    yes, its taking 2-3 days. Ok I will consider to giving incremental
> depth
> > and check stats every step. Thanks.
> > Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/and
> > have removed  +.
> >
> > what should be the depth for next recrawl case?  I mean this question:
> say
> > I had crawldb crawled with depth param 5 only and topN 10.. Now I find
> that
> > 3-4 urls were deleted and 4 were modified.. I don’t know which those urls
> > are. So what I am doing is re-initate crawl.  At this time, what I should
> > give depth param?
> >
> Once those urls enter the crawldb, crawler won't need to reach those from
> their parent page again. The crawler has stored those urls in its crawldb /
> webtable. With each url, a re-crawl interval is maintained (which is by
> default set to 30 days). Crawler wont pick a url for crawling if its fetch
> interval aint elapsed since the last time the url was fetched. Crawl
> interval can be configured using the db.fetch.interval.default property in
> nutch-site.xml.
>
> >
> > Thanks - David
> >
> >
> >
> > On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel <
> > [email protected]
> > > wrote:
> >
> > > Hi David,
> > >
> > > >  What can be crawl time for very big site, given depth param as 50,
> > topN
> > > > default(not passed ) and default fetch interval as 2mins..
> > > afaik, the default of topN is Long.MAX_VALUE which is very large.
> > > So, the size of the crawl is mainly limited by the number of links you
> > get.
> > > Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
> > > very defensive) your crawl will take a long time.
> > >
> > > Try to start with small values for depth and topN, e.g. 3 and 50.
> > > Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
> > > and check how the numbers of fetch/unfetched/gone/etc. URLs increase
> > > to get a feeling which values make sense for your crawl.
> > >
> > > > Case: Crawling website spicemobilephones.co.in, and in the
> > > > regexurlfilter.txt – added +^ http://(a-z 0-9)
> spicemobilephones.co.in.
> > > This doesn't look like a valid Java regex.
> > > Did you remove these lines:
> > >   # accept anything else
> > >   +.
> > >
> > > Sebastian
> > >
> >
>

Re: crawl time for depth param 50 and topN not passed

Reply via email to