Re: crawl time for depth param 50 and topN not passed
Thanks Tejas. In that case, if at all a new url was added to some of the urls that are there in crawldb, will that be crawled/fetched during the recrawl prcoess? Ex: there were 10urls in crawldb.. to one of the 4th url, there was child new url added after first crawl. So I re initiate crawl (depth 2)to get this new url also added to crawldb and fetch it.. How will this case work? Will it add this new url to crawldb and fetch it? Thanks - David On Mon, Apr 8, 2013 at 11:41 AM, Tejas Patil tejas.patil...@gmail.comwrote: On Sun, Apr 7, 2013 at 10:43 PM, David Philip davidphilipshe...@gmail.comwrote: Hi Tejas, Thank you..So what I understand is, when we initiate *re-crawl* with depth 1 , it will check for all the urls due for fetch in the first loop itself and fetch all the urls due for fetch. Correct? Yes. What I basically wanted to know is, as for the first fresh crawl depth was 20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same crawl db and same solr host (nutch 1.6), what should be the depth? All the urls in the crawldb (irrespective of the depth where those urls were discovered by the crawler) are scanned and based on various factors are considered for fetching. One of the factors is whether the re-fetch time has reached. In your case, if you had NOT changed the default re-fetch interval setting initially, no refetch will happen as the time (30 days) hasn't elapsed. However the crawl will continue from the point where it left off and will consider all the un-fetched urls in the crawldb for fetching. Thanks -David On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.com wrote: Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)* spicemobiles.co.in/and have removed +. what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Once those urls enter the crawldb, crawler won't need to reach those from their parent page again. The crawler has stored those urls in its crawldb / webtable. With each url, a re-crawl interval is maintained (which is by default set to 30 days). Crawler wont pick a url for crawling if its fetch interval aint elapsed since the last time the url was fetched. Crawl interval can be configured using the db.fetch.interval.default property in nutch-site.xml. Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9) spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian
Re: crawl time for depth param 50 and topN not passed
On Sun, Apr 7, 2013 at 10:43 PM, David Philip davidphilipshe...@gmail.comwrote: Hi Tejas, Thank you..So what I understand is, when we initiate *re-crawl* with depth 1 , it will check for all the urls due for fetch in the first loop itself and fetch all the urls due for fetch. Correct? Yes. What I basically wanted to know is, as for the first fresh crawl depth was 20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same crawl db and same solr host (nutch 1.6), what should be the depth? All the urls in the crawldb (irrespective of the depth where those urls were discovered by the crawler) are scanned and based on various factors are considered for fetching. One of the factors is whether the re-fetch time has reached. In your case, if you had NOT changed the default re-fetch interval setting initially, no refetch will happen as the time (30 days) hasn't elapsed. However the crawl will continue from the point where it left off and will consider all the un-fetched urls in the crawldb for fetching. Thanks -David On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.com wrote: Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)* spicemobiles.co.in/and have removed +. what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Once those urls enter the crawldb, crawler won't need to reach those from their parent page again. The crawler has stored those urls in its crawldb / webtable. With each url, a re-crawl interval is maintained (which is by default set to 30 days). Crawler wont pick a url for crawling if its fetch interval aint elapsed since the last time the url was fetched. Crawl interval can be configured using the db.fetch.interval.default property in nutch-site.xml. Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9) spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian
Re: crawl time for depth param 50 and topN not passed
Hi Tejas, Thank you..So what I understand is, when we initiate *re-crawl* with depth 1 , it will check for all the urls due for fetch in the first loop itself and fetch all the urls due for fetch. Correct? What I basically wanted to know is, as for the first fresh crawl depth was 20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same crawl db and same solr host (nutch 1.6), what should be the depth? Thanks -David On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil tejas.patil...@gmail.comwrote: On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.com wrote: Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/and have removed +. what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Once those urls enter the crawldb, crawler won't need to reach those from their parent page again. The crawler has stored those urls in its crawldb / webtable. With each url, a re-crawl interval is maintained (which is by default set to 30 days). Crawler wont pick a url for crawling if its fetch interval aint elapsed since the last time the url was fetched. Crawl interval can be configured using the db.fetch.interval.default property in nutch-site.xml. Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9) spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian
Re: crawl time for depth param 50 and topN not passed
Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and have removed +. what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian
Re: crawl time for depth param 50 and topN not passed
On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.comwrote: Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and have removed +. what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Once those urls enter the crawldb, crawler won't need to reach those from their parent page again. The crawler has stored those urls in its crawldb / webtable. With each url, a re-crawl interval is maintained (which is by default set to 30 days). Crawler wont pick a url for crawling if its fetch interval aint elapsed since the last time the url was fetched. Crawl interval can be configured using the db.fetch.interval.default property in nutch-site.xml. Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian
Re: crawl time for depth param 50 and topN not passed
Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian