Hi,

there are plenty of reasons why a document is missing.
See http://wiki.apache.org/nutch/DebugTool for a list
of possible reasons (sorry, explanations are missing).

About the example from jabong. I got 680 outlinks for
  http://www.jabong.com/men/shoes/mens-sports-shoes/
by calling
 % nutch parsechecker http://www.jabong.com/men/shoes/men-sports-shoes/
but
  http://www.jabong.com/Sports-White-Tennis-Shoes-2773.html
isn't among them. Many other products are. For example,
 % nutch parsechecker -dumpText 
http://www.jabong.com/Grey-Running-Shoes-13010.html
succeeds and I got the content. So maybe,
the product has just been sold out? Even, in Firefox I can't
see this pair of shoes. Also, there are many reasons why the
content delivered to the crawler is different from that seen
in the browser: cookies, dynamic Ajax content, browser switches, ...

Sebastian

On 03/26/2012 10:32 AM, blunderboy wrote:
> Hi,
> I am using apache-nutch 1.4 and it is crawling perfectly. But i have got
> some issues in crawling some sites.
> For testing my crawling, I took  http://www.jabong.com http://www.jabong.com 
> I found out it is able to crawl categories but could not crawl pages.
> 
> For example look at this:-
> http://www.jabong.com/men/shoes/mens-sports-shoes/               ----->
> (Page1)
> 
> Now nutch does not crawl the pages present inside this page..
> URL of one of the product is:-
> http://www.jabong.com/Sports-White-Tennis-Shoes-2773.html    ------->
> (Prod1)
> 
> 
> After some research, I got to know the structure of this site is:
> 1. Home dir contains all the product pages.
> If you see the source of page(Page1), it contains link to Prod1 which is
> actually in the home directory.
> So may be this is the reason it is not crawling product pages.
> 
> Can some body please tell me how to solve this and make nutch to crawl such
> pages too.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p3857630.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to