Re: crawl time for depth param 50 and topN not passed

Tejas Patil Tue, 09 Apr 2013 01:16:33 -0700

When a new url is discovered as a child while crawling some url, this new
url will be added to the crawldb. From next round onward, it will be
considered as a candidate for creating fetch list along with the old urls.


Here is a simple example. Say you are having the home page of nutch as the
initial url. Apart from that you might have some other 100 urls. In first
depth, say the crawler crawls all the urls and nutch homepage was one of
it. Nutch discovers other pages from apache nutch like About, Downloads
page. These child pages are added to the crawldb. After 6 months, you now
want to pick things from where you left off. You run a crawl again with the
same crawldb. Both, the old urls and the child pages, will be elligible for
fetching. The old pages will get re-fetched while the child pages get
fetched for the first time.

hth

On Mon, Apr 8, 2013 at 11:06 PM, David Philip
<[email protected]>wrote:

> Thanks Tejas. In that case, if at all a new url was added to some of the
> urls that are there in crawldb, will that be crawled/fetched during the
> recrawl prcoess?

Ex: there were 10urls in crawldb.. to one of the 4th url, there was child
> new url added after first crawl. So I re initiate crawl (depth 2)to get
> this new url also added to crawldb and fetch it.. How will this case work?
> Will it add this new url to crawldb and fetch it?
>
> Thanks - David
>
>
>
> On Mon, Apr 8, 2013 at 11:41 AM, Tejas Patil <[email protected]
> >wrote:
>
> > On Sun, Apr 7, 2013 at 10:43 PM, David Philip
> > <[email protected]>wrote:
> >
> > > Hi Tejas,
> > >
> > >    Thank you..So what I understand is, when we initiate *re-crawl* with
> > > depth 1 , it will check for all the urls due for fetch in the first
> loop
> > > itself and fetch all the urls due for fetch. Correct?
> > >
> > Yes.
> >
> > >
> > > What I basically wanted to know is, as for the first fresh crawl depth
> > was
> > > 20 and it took 1 day and now(afer a week) I want to reinitiate
> crawl(same
> > > crawl db and same solr host (nutch 1.6),  what should be the depth?
> > >
> > All the urls in the crawldb (irrespective of the depth where those urls
> > were discovered by the crawler) are scanned and based on various factors
> > are considered for fetching. One of the factors is whether the re-fetch
> > time has reached. In your case, if you had NOT changed the default
> re-fetch
> > interval setting initially, no refetch will happen as the time (30 days)
> > hasn't elapsed. However the crawl will continue from the point where it
> > left off and will consider all the un-fetched urls in the crawldb for
> > fetching.
> >
> > >
> > > Thanks -David
> > >
> > >
> > >
> > >
> > > On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil <[email protected]
> > > >wrote:
> > >
> > > > On Sat, Apr 6, 2013 at 3:31 AM, David Philip <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > Hi Sebastian,
> > > > >
> > > > >    yes, its taking 2-3 days. Ok I will consider to giving
> incremental
> > > > depth
> > > > > and check stats every step. Thanks.
> > > > > Yes I have given like this +^http://([a-z0-9]*\.)*
> > > spicemobiles.co.in/and
> > > > > have removed  +.
> > > > >
> > > > > what should be the depth for next recrawl case?  I mean this
> > question:
> > > > say
> > > > > I had crawldb crawled with depth param 5 only and topN 10.. Now I
> > find
> > > > that
> > > > > 3-4 urls were deleted and 4 were modified.. I don’t know which
> those
> > > urls
> > > > > are. So what I am doing is re-initate crawl.  At this time, what I
> > > should
> > > > > give depth param?
> > > > >
> > > > Once those urls enter the crawldb, crawler won't need to reach those
> > from
> > > > their parent page again. The crawler has stored those urls in its
> > > crawldb /
> > > > webtable. With each url, a re-crawl interval is maintained (which is
> by
> > > > default set to 30 days). Crawler wont pick a url for crawling if its
> > > fetch
> > > > interval aint elapsed since the last time the url was fetched. Crawl
> > > > interval can be configured using the db.fetch.interval.default
> property
> > > in
> > > > nutch-site.xml.
> > > >
> > > > >
> > > > > Thanks - David
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel <
> > > > > [email protected]
> > > > > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > >  What can be crawl time for very big site, given depth param as
> > 50,
> > > > > topN
> > > > > > > default(not passed ) and default fetch interval as 2mins..
> > > > > > afaik, the default of topN is Long.MAX_VALUE which is very large.
> > > > > > So, the size of the crawl is mainly limited by the number of
> links
> > > you
> > > > > get.
> > > > > > Anyway, a depth of 50 is a high values, with a delay of 2min.
> > (which
> > > is
> > > > > > very defensive) your crawl will take a long time.
> > > > > >
> > > > > > Try to start with small values for depth and topN, e.g. 3 and 50.
> > > > > > Then look at your crawlDb statistics (bin/nutch readdb ...
> -stats)
> > > > > > and check how the numbers of fetch/unfetched/gone/etc. URLs
> > increase
> > > > > > to get a feeling which values make sense for your crawl.
> > > > > >
> > > > > > > Case: Crawling website spicemobilephones.co.in, and in the
> > > > > > > regexurlfilter.txt – added +^ http://(a-z 0-9)
> > > > spicemobilephones.co.in.
> > > > > > This doesn't look like a valid Java regex.
> > > > > > Did you remove these lines:
> > > > > >   # accept anything else
> > > > > >   +.
> > > > > >
> > > > > > Sebastian
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: crawl time for depth param 50 and topN not passed

Reply via email to