Hello,

Some how Nutch is unable to fetch contents from the below website. It just
fetches text "Analytical Cytometry". All other text is skipped. I am not
sure why this is happening. Nutch is able to crawl and fetch all other
websites. I am using Nutch 1.4 version.

http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx

And also, all the links within this page are relative url's.

Ex: I want to fetch this url which is within the above url.
http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx

However, there is just relative url like this
/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx<http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx>

Will nutch crawl/fetch websites with relatives url's by default i.e with no
additional configurations? Also I am not sure how to set regular expression
so these pages will be fetched. I want to fetch all the pages which starts
with http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/ .
Thank you.

Regards,
Sandeep

Reply via email to