Hi Sebastian, Worked perfectly. Thank you again.
Regards, Sandeep On Tue, Jun 19, 2012 at 5:20 PM, Sebastian Nagel <wastl.na...@googlemail.com > wrote: > Hi Sandeep, > > >>> However, there is just relative url like this > >>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx > You don't have to care about relative URLs. They are converted by Nutch > to absolute URLs and URL filters operate exclusively on absolute URLs. > > >>> all the pages which starts with > >>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/ > You can use either > -urlfilter-prefix by adding this prefix to conf/prefix-urlfilter > (don't forget to enable the plugin via property plugin.includes) > -urlfilter-regex by the replacing the last two lines of > conf/regex-urlfilter.txt > # accept anything else > +. > by > ^http://cancer\.osu\.edu/research/cancerresearch/sharedresources/ac/ > see > http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website > > On 06/19/2012 10:51 PM, Sandeep C R wrote: > > Hi Sebastian, > > > > You are right. After setting it to -1 it worked. I am able to get all the > > text. Thank you. > > > > It will be really helpful if you/others can guide me with relative url's > > and regular expression problem which I have mentioned in main post. > > > > Regards, > > Sandeep > > > > On Tue, Jun 19, 2012 at 4:28 PM, Sebastian Nagel < > wastl.na...@googlemail.com > >> wrote: > > > >> Hi Sandeep, > >> > >>> It just fetches text "Analytical Cytometry". > >> It looks like the property http.content.limit > >> is still on its default (64kB) which causes the > >> document to be truncated right after "Analytical > >> Cytometry". > >> Unfortunately, truncated content is not logged > >> to make it easier to locate the reason, see > >> http://wiki.apache.org/nutch/DebugTool > >> https://issues.apache.org/jira/browse/NUTCH-1389 > >> > >> You should increase the value in your nutch-site.xml > >> and use parsechecker for a quick trial: > >> % nutch parsechecker -dumpText > >> > >> > http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx > >> > >> Sebastian > >> > >> On 06/19/2012 09:37 PM, Sandeep C R wrote: > >>> Hello, > >>> > >>> Some how Nutch is unable to fetch contents from the below website. It > >> just > >>> fetches text "Analytical Cytometry". All other text is skipped. I am > not > >>> sure why this is happening. Nutch is able to crawl and fetch all other > >>> websites. I am using Nutch 1.4 version. > >>> > >>> > >> > http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx > >>> > >>> And also, all the links within this page are relative url's. > >>> > >>> Ex: I want to fetch this url which is within the above url. > >>> > >> > http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx > >>> > >>> However, there is just relative url like this > >>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx< > >> > http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx > >>> > >>> > >>> Will nutch crawl/fetch websites with relatives url's by default i.e > with > >> no > >>> additional configurations? Also I am not sure how to set regular > >> expression > >>> so these pages will be fetched. I want to fetch all the pages which > >> starts > >>> with http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/. > >>> Thank you. > >>> > >>> Regards, > >>> Sandeep > >>> > >> > >> > > > >