Re: crawl time for depth param 50 and topN not passed

2013-04-09 Thread David Philip
Thanks Tejas. In that case, if at all a new url was added to some of the
urls that are there in crawldb, will that be crawled/fetched during the
recrawl prcoess?

Ex: there were 10urls in crawldb.. to one of the 4th url, there was child
new url added after first crawl. So I re initiate crawl (depth 2)to get
this new url also added to crawldb and fetch it.. How will this case work?
Will it add this new url to crawldb and fetch it?

Thanks - David



On Mon, Apr 8, 2013 at 11:41 AM, Tejas Patil tejas.patil...@gmail.comwrote:

 On Sun, Apr 7, 2013 at 10:43 PM, David Philip
 davidphilipshe...@gmail.comwrote:

  Hi Tejas,
 
 Thank you..So what I understand is, when we initiate *re-crawl* with
  depth 1 , it will check for all the urls due for fetch in the first loop
  itself and fetch all the urls due for fetch. Correct?
 
 Yes.

 
  What I basically wanted to know is, as for the first fresh crawl depth
 was
  20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same
  crawl db and same solr host (nutch 1.6),  what should be the depth?
 
 All the urls in the crawldb (irrespective of the depth where those urls
 were discovered by the crawler) are scanned and based on various factors
 are considered for fetching. One of the factors is whether the re-fetch
 time has reached. In your case, if you had NOT changed the default re-fetch
 interval setting initially, no refetch will happen as the time (30 days)
 hasn't elapsed. However the crawl will continue from the point where it
 left off and will consider all the un-fetched urls in the crawldb for
 fetching.

 
  Thanks -David
 
 
 
 
  On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil tejas.patil...@gmail.com
  wrote:
 
   On Sat, Apr 6, 2013 at 3:31 AM, David Philip 
  davidphilipshe...@gmail.com
   wrote:
  
Hi Sebastian,
   
   yes, its taking 2-3 days. Ok I will consider to giving incremental
   depth
and check stats every step. Thanks.
Yes I have given like this +^http://([a-z0-9]*\.)*
  spicemobiles.co.in/and
have removed  +.
   
what should be the depth for next recrawl case?  I mean this
 question:
   say
I had crawldb crawled with depth param 5 only and topN 10.. Now I
 find
   that
3-4 urls were deleted and 4 were modified.. I don’t know which those
  urls
are. So what I am doing is re-initate crawl.  At this time, what I
  should
give depth param?
   
   Once those urls enter the crawldb, crawler won't need to reach those
 from
   their parent page again. The crawler has stored those urls in its
  crawldb /
   webtable. With each url, a re-crawl interval is maintained (which is by
   default set to 30 days). Crawler wont pick a url for crawling if its
  fetch
   interval aint elapsed since the last time the url was fetched. Crawl
   interval can be configured using the db.fetch.interval.default property
  in
   nutch-site.xml.
  
   
Thanks - David
   
   
   
On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel 
wastl.na...@googlemail.com
 wrote:
   
 Hi David,

   What can be crawl time for very big site, given depth param as
 50,
topN
  default(not passed ) and default fetch interval as 2mins..
 afaik, the default of topN is Long.MAX_VALUE which is very large.
 So, the size of the crawl is mainly limited by the number of links
  you
get.
 Anyway, a depth of 50 is a high values, with a delay of 2min.
 (which
  is
 very defensive) your crawl will take a long time.

 Try to start with small values for depth and topN, e.g. 3 and 50.
 Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
 and check how the numbers of fetch/unfetched/gone/etc. URLs
 increase
 to get a feeling which values make sense for your crawl.

  Case: Crawling website spicemobilephones.co.in, and in the
  regexurlfilter.txt – added +^ http://(a-z 0-9)
   spicemobilephones.co.in.
 This doesn't look like a valid Java regex.
 Did you remove these lines:
   # accept anything else
   +.

 Sebastian

   
  
 



Re: crawl time for depth param 50 and topN not passed

2013-04-08 Thread Tejas Patil
On Sun, Apr 7, 2013 at 10:43 PM, David Philip
davidphilipshe...@gmail.comwrote:

 Hi Tejas,

Thank you..So what I understand is, when we initiate *re-crawl* with
 depth 1 , it will check for all the urls due for fetch in the first loop
 itself and fetch all the urls due for fetch. Correct?

Yes.


 What I basically wanted to know is, as for the first fresh crawl depth was
 20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same
 crawl db and same solr host (nutch 1.6),  what should be the depth?

All the urls in the crawldb (irrespective of the depth where those urls
were discovered by the crawler) are scanned and based on various factors
are considered for fetching. One of the factors is whether the re-fetch
time has reached. In your case, if you had NOT changed the default re-fetch
interval setting initially, no refetch will happen as the time (30 days)
hasn't elapsed. However the crawl will continue from the point where it
left off and will consider all the un-fetched urls in the crawldb for
fetching.


 Thanks -David




 On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil tejas.patil...@gmail.com
 wrote:

  On Sat, Apr 6, 2013 at 3:31 AM, David Philip 
 davidphilipshe...@gmail.com
  wrote:
 
   Hi Sebastian,
  
  yes, its taking 2-3 days. Ok I will consider to giving incremental
  depth
   and check stats every step. Thanks.
   Yes I have given like this +^http://([a-z0-9]*\.)*
 spicemobiles.co.in/and
   have removed  +.
  
   what should be the depth for next recrawl case?  I mean this question:
  say
   I had crawldb crawled with depth param 5 only and topN 10.. Now I find
  that
   3-4 urls were deleted and 4 were modified.. I don’t know which those
 urls
   are. So what I am doing is re-initate crawl.  At this time, what I
 should
   give depth param?
  
  Once those urls enter the crawldb, crawler won't need to reach those from
  their parent page again. The crawler has stored those urls in its
 crawldb /
  webtable. With each url, a re-crawl interval is maintained (which is by
  default set to 30 days). Crawler wont pick a url for crawling if its
 fetch
  interval aint elapsed since the last time the url was fetched. Crawl
  interval can be configured using the db.fetch.interval.default property
 in
  nutch-site.xml.
 
  
   Thanks - David
  
  
  
   On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel 
   wastl.na...@googlemail.com
wrote:
  
Hi David,
   
  What can be crawl time for very big site, given depth param as 50,
   topN
 default(not passed ) and default fetch interval as 2mins..
afaik, the default of topN is Long.MAX_VALUE which is very large.
So, the size of the crawl is mainly limited by the number of links
 you
   get.
Anyway, a depth of 50 is a high values, with a delay of 2min. (which
 is
very defensive) your crawl will take a long time.
   
Try to start with small values for depth and topN, e.g. 3 and 50.
Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
and check how the numbers of fetch/unfetched/gone/etc. URLs increase
to get a feeling which values make sense for your crawl.
   
 Case: Crawling website spicemobilephones.co.in, and in the
 regexurlfilter.txt – added +^ http://(a-z 0-9)
  spicemobilephones.co.in.
This doesn't look like a valid Java regex.
Did you remove these lines:
  # accept anything else
  +.
   
Sebastian
   
  
 



Re: crawl time for depth param 50 and topN not passed

2013-04-07 Thread David Philip
Hi Tejas,

   Thank you..So what I understand is, when we initiate *re-crawl* with
depth 1 , it will check for all the urls due for fetch in the first loop
itself and fetch all the urls due for fetch. Correct?

What I basically wanted to know is, as for the first fresh crawl depth was
20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same
crawl db and same solr host (nutch 1.6),  what should be the depth?

Thanks -David




On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.com
 wrote:

  Hi Sebastian,
 
 yes, its taking 2-3 days. Ok I will consider to giving incremental
 depth
  and check stats every step. Thanks.
  Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/and
  have removed  +.
 
  what should be the depth for next recrawl case?  I mean this question:
 say
  I had crawldb crawled with depth param 5 only and topN 10.. Now I find
 that
  3-4 urls were deleted and 4 were modified.. I don’t know which those urls
  are. So what I am doing is re-initate crawl.  At this time, what I should
  give depth param?
 
 Once those urls enter the crawldb, crawler won't need to reach those from
 their parent page again. The crawler has stored those urls in its crawldb /
 webtable. With each url, a re-crawl interval is maintained (which is by
 default set to 30 days). Crawler wont pick a url for crawling if its fetch
 interval aint elapsed since the last time the url was fetched. Crawl
 interval can be configured using the db.fetch.interval.default property in
 nutch-site.xml.

 
  Thanks - David
 
 
 
  On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel 
  wastl.na...@googlemail.com
   wrote:
 
   Hi David,
  
 What can be crawl time for very big site, given depth param as 50,
  topN
default(not passed ) and default fetch interval as 2mins..
   afaik, the default of topN is Long.MAX_VALUE which is very large.
   So, the size of the crawl is mainly limited by the number of links you
  get.
   Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
   very defensive) your crawl will take a long time.
  
   Try to start with small values for depth and topN, e.g. 3 and 50.
   Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
   and check how the numbers of fetch/unfetched/gone/etc. URLs increase
   to get a feeling which values make sense for your crawl.
  
Case: Crawling website spicemobilephones.co.in, and in the
regexurlfilter.txt – added +^ http://(a-z 0-9)
 spicemobilephones.co.in.
   This doesn't look like a valid Java regex.
   Did you remove these lines:
 # accept anything else
 +.
  
   Sebastian
  
 



Re: crawl time for depth param 50 and topN not passed

2013-04-06 Thread David Philip
Hi Sebastian,

   yes, its taking 2-3 days. Ok I will consider to giving incremental depth
and check stats every step. Thanks.
Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and
have removed  +.

what should be the depth for next recrawl case?  I mean this question: say
I had crawldb crawled with depth param 5 only and topN 10.. Now I find that
3-4 urls were deleted and 4 were modified.. I don’t know which those urls
are. So what I am doing is re-initate crawl.  At this time, what I should
give depth param?

Thanks - David



On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

 Hi David,

   What can be crawl time for very big site, given depth param as 50, topN
  default(not passed ) and default fetch interval as 2mins..
 afaik, the default of topN is Long.MAX_VALUE which is very large.
 So, the size of the crawl is mainly limited by the number of links you get.
 Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
 very defensive) your crawl will take a long time.

 Try to start with small values for depth and topN, e.g. 3 and 50.
 Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
 and check how the numbers of fetch/unfetched/gone/etc. URLs increase
 to get a feeling which values make sense for your crawl.

  Case: Crawling website spicemobilephones.co.in, and in the
  regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
 This doesn't look like a valid Java regex.
 Did you remove these lines:
   # accept anything else
   +.

 Sebastian



Re: crawl time for depth param 50 and topN not passed

2013-04-06 Thread Tejas Patil
On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.comwrote:

 Hi Sebastian,

yes, its taking 2-3 days. Ok I will consider to giving incremental depth
 and check stats every step. Thanks.
 Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and
 have removed  +.

 what should be the depth for next recrawl case?  I mean this question: say
 I had crawldb crawled with depth param 5 only and topN 10.. Now I find that
 3-4 urls were deleted and 4 were modified.. I don’t know which those urls
 are. So what I am doing is re-initate crawl.  At this time, what I should
 give depth param?

Once those urls enter the crawldb, crawler won't need to reach those from
their parent page again. The crawler has stored those urls in its crawldb /
webtable. With each url, a re-crawl interval is maintained (which is by
default set to 30 days). Crawler wont pick a url for crawling if its fetch
interval aint elapsed since the last time the url was fetched. Crawl
interval can be configured using the db.fetch.interval.default property in
nutch-site.xml.


 Thanks - David



 On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel 
 wastl.na...@googlemail.com
  wrote:

  Hi David,
 
What can be crawl time for very big site, given depth param as 50,
 topN
   default(not passed ) and default fetch interval as 2mins..
  afaik, the default of topN is Long.MAX_VALUE which is very large.
  So, the size of the crawl is mainly limited by the number of links you
 get.
  Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
  very defensive) your crawl will take a long time.
 
  Try to start with small values for depth and topN, e.g. 3 and 50.
  Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
  and check how the numbers of fetch/unfetched/gone/etc. URLs increase
  to get a feeling which values make sense for your crawl.
 
   Case: Crawling website spicemobilephones.co.in, and in the
   regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
  This doesn't look like a valid Java regex.
  Did you remove these lines:
# accept anything else
+.
 
  Sebastian
 



Re: crawl time for depth param 50 and topN not passed

2013-04-05 Thread Sebastian Nagel
Hi David,

  What can be crawl time for very big site, given depth param as 50, topN
 default(not passed ) and default fetch interval as 2mins..
afaik, the default of topN is Long.MAX_VALUE which is very large.
So, the size of the crawl is mainly limited by the number of links you get.
Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
very defensive) your crawl will take a long time.

Try to start with small values for depth and topN, e.g. 3 and 50.
Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
and check how the numbers of fetch/unfetched/gone/etc. URLs increase
to get a feeling which values make sense for your crawl.

 Case: Crawling website spicemobilephones.co.in, and in the
 regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
This doesn't look like a valid Java regex.
Did you remove these lines:
  # accept anything else
  +.

Sebastian