Hello, Some how Nutch is unable to fetch contents from the below website. It just fetches text "Analytical Cytometry". All other text is skipped. I am not sure why this is happening. Nutch is able to crawl and fetch all other websites. I am using Nutch 1.4 version.
http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx And also, all the links within this page are relative url's. Ex: I want to fetch this url which is within the above url. http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx However, there is just relative url like this /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx<http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx> Will nutch crawl/fetch websites with relatives url's by default i.e with no additional configurations? Also I am not sure how to set regular expression so these pages will be fetched. I want to fetch all the pages which starts with http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/ . Thank you. Regards, Sandeep