Hi, there are plenty of reasons why a document is missing. See http://wiki.apache.org/nutch/DebugTool for a list of possible reasons (sorry, explanations are missing).
About the example from jabong. I got 680 outlinks for http://www.jabong.com/men/shoes/mens-sports-shoes/ by calling % nutch parsechecker http://www.jabong.com/men/shoes/men-sports-shoes/ but http://www.jabong.com/Sports-White-Tennis-Shoes-2773.html isn't among them. Many other products are. For example, % nutch parsechecker -dumpText http://www.jabong.com/Grey-Running-Shoes-13010.html succeeds and I got the content. So maybe, the product has just been sold out? Even, in Firefox I can't see this pair of shoes. Also, there are many reasons why the content delivered to the crawler is different from that seen in the browser: cookies, dynamic Ajax content, browser switches, ... Sebastian On 03/26/2012 10:32 AM, blunderboy wrote: > Hi, > I am using apache-nutch 1.4 and it is crawling perfectly. But i have got > some issues in crawling some sites. > For testing my crawling, I took http://www.jabong.com http://www.jabong.com > I found out it is able to crawl categories but could not crawl pages. > > For example look at this:- > http://www.jabong.com/men/shoes/mens-sports-shoes/ -----> > (Page1) > > Now nutch does not crawl the pages present inside this page.. > URL of one of the product is:- > http://www.jabong.com/Sports-White-Tennis-Shoes-2773.html -------> > (Prod1) > > > After some research, I got to know the structure of this site is: > 1. Home dir contains all the product pages. > If you see the source of page(Page1), it contains link to Prod1 which is > actually in the home directory. > So may be this is the reason it is not crawling product pages. > > Can some body please tell me how to solve this and make nutch to crawl such > pages too. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p3857630.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

