This is the behaviour by default and redirs are treated like normal links
i.e. they are fetched in subsequent rounds.
This can be changed using the param

*<property>
 <name>http.redirect.max</name>
 <value>0</value>
 <description>The maximum number of redirects the fetcher will follow when
 trying to fetch a page. If set to negative or 0, fetcher won't immediately
 follow redirected URLs, instead it will record them for later fetching.
 </description>
</property>*

Julien

On 1 March 2011 16:46,  <[email protected]> wrote:
> For some reason nutch starts to crawl inner links at depth 4 for domains
with redirects.
>
>
>
>
>
>
>
> -----Original Message-----
> From: hemantverma09 <[email protected]>
> To: nutch-user <[email protected]>
> Sent: Tue, Mar 1, 2011 6:17 am
> Subject: Can't Crawl Through Home Page, but crawling through inner page
>
>
> I am using nutch 1.1 for crawling.
>
> I am able to crawl so many site without any issue but when I am crawling
>
> www.magicbricks.com
>
> it is stopping at depth=1.
>
> I am using "bin/nutch crawl urls/magicbricks/url.txt -dir
crawl/magicbricks
>
> -threads 10 -depth 3 -topN 10"
>
> But if I put links like "http://www.magicbricks.com/bricks/cityIndex.html";
>
> or "http://www.magicbricks.com/bricks/propertySearch.html"; in
>
> urls/magicbricks/url.txt it crawls without any issue.
>
>
>
> In robots.txt I have allowed my crawler named Propertybot all access to
>
> crawl, which can be seen by using http://magicbricks.com/robots.txt
>
>
>
> Please suggest what can be the reasons, why it is happening.
>
>
>
> Thanks in advance
>
> Hemant Verma
>
>
>
> --
>
> View this message in context:
http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2601843.html
>
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
>
>
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to