For some reason nutch starts to crawl inner links at depth 4 for domains with 
redirects. 
 


 

 

-----Original Message-----
From: hemantverma09 <[email protected]>
To: nutch-user <[email protected]>
Sent: Tue, Mar 1, 2011 6:17 am
Subject: Can't Crawl Through Home Page, but crawling through inner page


I am using nutch 1.1 for crawling.

I am able to crawl so many site without any issue but when I am crawling

www.magicbricks.com

it is stopping at depth=1.

I am using "bin/nutch crawl urls/magicbricks/url.txt -dir crawl/magicbricks

-threads 10 -depth 3 -topN 10"

But if I put links like "http://www.magicbricks.com/bricks/cityIndex.html";

or "http://www.magicbricks.com/bricks/propertySearch.html"; in

urls/magicbricks/url.txt it crawls without any issue.



In robots.txt I have allowed my crawler named Propertybot all access to

crawl, which can be seen by using http://magicbricks.com/robots.txt



Please suggest what can be the reasons, why it is happening.



Thanks in advance

Hemant Verma



-- 

View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2601843.html

Sent from the Nutch - User mailing list archive at Nabble.com.




 

Reply via email to