This is the behaviour by default and redirs are treated like normal links i.e. they are fetched in subsequent rounds. This can be changed using the param
*<property> <name>http.redirect.max</name> <value>0</value> <description>The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. </description> </property>* Julien On 1 March 2011 16:46, <[email protected]> wrote: > For some reason nutch starts to crawl inner links at depth 4 for domains with redirects. > > > > > > > > -----Original Message----- > From: hemantverma09 <[email protected]> > To: nutch-user <[email protected]> > Sent: Tue, Mar 1, 2011 6:17 am > Subject: Can't Crawl Through Home Page, but crawling through inner page > > > I am using nutch 1.1 for crawling. > > I am able to crawl so many site without any issue but when I am crawling > > www.magicbricks.com > > it is stopping at depth=1. > > I am using "bin/nutch crawl urls/magicbricks/url.txt -dir crawl/magicbricks > > -threads 10 -depth 3 -topN 10" > > But if I put links like "http://www.magicbricks.com/bricks/cityIndex.html" > > or "http://www.magicbricks.com/bricks/propertySearch.html" in > > urls/magicbricks/url.txt it crawls without any issue. > > > > In robots.txt I have allowed my crawler named Propertybot all access to > > crawl, which can be seen by using http://magicbricks.com/robots.txt > > > > Please suggest what can be the reasons, why it is happening. > > > > Thanks in advance > > Hemant Verma > > > > -- > > View this message in context: http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2601843.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

