Re: crawl time for depth param 50 and topN not passed
Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and have removed +. what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian
Re: crawl time for depth param 50 and topN not passed
On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.comwrote: Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and have removed +. what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Once those urls enter the crawldb, crawler won't need to reach those from their parent page again. The crawler has stored those urls in its crawldb / webtable. With each url, a re-crawl interval is maintained (which is by default set to 30 days). Crawler wont pick a url for crawling if its fetch interval aint elapsed since the last time the url was fetched. Crawl interval can be configured using the db.fetch.interval.default property in nutch-site.xml. Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi David, What can be crawl time for very big site, given depth param as 50, topN default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time. Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. Case: Crawling website spicemobilephones.co.in, and in the regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian
Setting up nutch 1.6 with Solr 4.2
Hi all, I have nutch 1.6 setup and running with Solr 3.6.2 and I'm trying to upgrade to Solr 4.2 but I'm missing something... I re-built nutch with schema-solr4.xml as schema.xml and copied the schema-solr4.xml to Solr example/collection1/conf/schema.xml The index phase keeps failing throwing errors about unknown field host and metatag.description (metatags worked just fine with 3.6.2). What else am I missing ? Thanks.
Nutch
Hi, Is there any way to perform a urlfilter from level 1-5 and a different one from 5 onwards. I need to extract pdf files which will be only after a given level (just to experiment). After that I believe the pdf files will be stored in a compressed binary format in the crawl\segment folder. I would like to extract these pdf files and store all in 1 folder. (I guess since Nutch uses MapReduce by segments the data, I will need to use the hadoop api present by default in the lib folder. I can not find more tutorials on the same except allendayhttp://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html ). PJ
Re: Nutch
On Sat, Apr 6, 2013 at 9:58 AM, Parin Jogani ppjog...@usc.edu wrote: Hi, Is there any way to perform a urlfilter from level 1-5 and a different one from 5 onwards. I need to extract pdf files which will be only after a given level (just to experiment). You can run 2 crawls over the same crawldb using different urlfilter files. First one would be rejecting pdf files and executed till a depth just before you discover pdf files. For later crawl, modify the regex rule to accept pdf files. After that I believe the pdf files will be stored in a compressed binary format in the crawl\segment folder. I would like to extract these pdf files and store all in 1 folder. (I guess since Nutch uses MapReduce by segments the data, I will need to use the hadoop api present by default in the lib folder. I can not find more tutorials on the same except allenday http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html ). I had a peek at the link that you gave and seems like that code snippet should work. Its an old article (from 2010) so it might happen that some classes are replaced with new ones. If you face any issues, please feel free to shoot an email to us !!! PJ
encode special characters in url
Hi all, I'm using nutch 1.6 to crawl a web site which have lots of special characters in the url, like ?,=@ etc. For each character, I can add a regex in the regex-normalize.xml to change it into percent encoding. My question is, is there an easier way to do this? Like a url-encode method to encode all the special characters rather than add regex one by one? Thanks!