On Sun, Apr 7, 2013 at 10:43 PM, David Philip
<[email protected]>wrote:

> Hi Tejas,
>
>    Thank you..So what I understand is, when we initiate *re-crawl* with
> depth 1 , it will check for all the urls due for fetch in the first loop
> itself and fetch all the urls due for fetch. Correct?
>
Yes.

>
> What I basically wanted to know is, as for the first fresh crawl depth was
> 20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same
> crawl db and same solr host (nutch 1.6),  what should be the depth?
>
All the urls in the crawldb (irrespective of the depth where those urls
were discovered by the crawler) are scanned and based on various factors
are considered for fetching. One of the factors is whether the re-fetch
time has reached. In your case, if you had NOT changed the default re-fetch
interval setting initially, no refetch will happen as the time (30 days)
hasn't elapsed. However the crawl will continue from the point where it
left off and will consider all the un-fetched urls in the crawldb for
fetching.

>
> Thanks -David
>
>
>
>
> On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil <[email protected]
> >wrote:
>
> > On Sat, Apr 6, 2013 at 3:31 AM, David Philip <
> [email protected]
> > >wrote:
> >
> > > Hi Sebastian,
> > >
> > >    yes, its taking 2-3 days. Ok I will consider to giving incremental
> > depth
> > > and check stats every step. Thanks.
> > > Yes I have given like this +^http://([a-z0-9]*\.)*
> spicemobiles.co.in/and
> > > have removed  +.
> > >
> > > what should be the depth for next recrawl case?  I mean this question:
> > say
> > > I had crawldb crawled with depth param 5 only and topN 10.. Now I find
> > that
> > > 3-4 urls were deleted and 4 were modified.. I don’t know which those
> urls
> > > are. So what I am doing is re-initate crawl.  At this time, what I
> should
> > > give depth param?
> > >
> > Once those urls enter the crawldb, crawler won't need to reach those from
> > their parent page again. The crawler has stored those urls in its
> crawldb /
> > webtable. With each url, a re-crawl interval is maintained (which is by
> > default set to 30 days). Crawler wont pick a url for crawling if its
> fetch
> > interval aint elapsed since the last time the url was fetched. Crawl
> > interval can be configured using the db.fetch.interval.default property
> in
> > nutch-site.xml.
> >
> > >
> > > Thanks - David
> > >
> > >
> > >
> > > On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel <
> > > [email protected]
> > > > wrote:
> > >
> > > > Hi David,
> > > >
> > > > >  What can be crawl time for very big site, given depth param as 50,
> > > topN
> > > > > default(not passed ) and default fetch interval as 2mins..
> > > > afaik, the default of topN is Long.MAX_VALUE which is very large.
> > > > So, the size of the crawl is mainly limited by the number of links
> you
> > > get.
> > > > Anyway, a depth of 50 is a high values, with a delay of 2min. (which
> is
> > > > very defensive) your crawl will take a long time.
> > > >
> > > > Try to start with small values for depth and topN, e.g. 3 and 50.
> > > > Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
> > > > and check how the numbers of fetch/unfetched/gone/etc. URLs increase
> > > > to get a feeling which values make sense for your crawl.
> > > >
> > > > > Case: Crawling website spicemobilephones.co.in, and in the
> > > > > regexurlfilter.txt – added +^ http://(a-z 0-9)
> > spicemobilephones.co.in.
> > > > This doesn't look like a valid Java regex.
> > > > Did you remove these lines:
> > > >   # accept anything else
> > > >   +.
> > > >
> > > > Sebastian
> > > >
> > >
> >
>

Reply via email to