Hi Sebastian,

Worked perfectly. Thank you again.

Regards,
Sandeep

On Tue, Jun 19, 2012 at 5:20 PM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> Hi Sandeep,
>
> >>> However, there is just relative url like this
> >>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
> You don't have to care about relative URLs. They are converted by Nutch
> to absolute URLs and URL filters operate exclusively on absolute URLs.
>
> >>> all the pages which starts with
> >>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/
> You can use either
> -urlfilter-prefix by adding this prefix to conf/prefix-urlfilter
>  (don't forget to enable the plugin via property plugin.includes)
> -urlfilter-regex by the replacing the last two lines of
> conf/regex-urlfilter.txt
>  # accept anything else
>  +.
>  by
>  ^http://cancer\.osu\.edu/research/cancerresearch/sharedresources/ac/
>  see
> http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
>
> On 06/19/2012 10:51 PM, Sandeep C R wrote:
> > Hi Sebastian,
> >
> > You are right. After setting it to -1 it worked. I am able to get all the
> > text. Thank you.
> >
> > It will be really helpful if you/others can guide me with relative url's
> > and regular expression problem which I have mentioned in main post.
> >
> > Regards,
> > Sandeep
> >
> > On Tue, Jun 19, 2012 at 4:28 PM, Sebastian Nagel <
> wastl.na...@googlemail.com
> >> wrote:
> >
> >> Hi Sandeep,
> >>
> >>> It just fetches text "Analytical Cytometry".
> >> It looks like the property http.content.limit
> >> is still on its default (64kB) which causes the
> >> document to be truncated right after "Analytical
> >> Cytometry".
> >> Unfortunately, truncated content is not logged
> >> to make it easier to locate the reason, see
> >>  http://wiki.apache.org/nutch/DebugTool
> >>  https://issues.apache.org/jira/browse/NUTCH-1389
> >>
> >> You should increase the value in your nutch-site.xml
> >> and use parsechecker for a quick trial:
> >> % nutch parsechecker -dumpText
> >>
> >>
> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx
> >>
> >> Sebastian
> >>
> >> On 06/19/2012 09:37 PM, Sandeep C R wrote:
> >>> Hello,
> >>>
> >>> Some how Nutch is unable to fetch contents from the below website. It
> >> just
> >>> fetches text "Analytical Cytometry". All other text is skipped. I am
> not
> >>> sure why this is happening. Nutch is able to crawl and fetch all other
> >>> websites. I am using Nutch 1.4 version.
> >>>
> >>>
> >>
> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx
> >>>
> >>> And also, all the links within this page are relative url's.
> >>>
> >>> Ex: I want to fetch this url which is within the above url.
> >>>
> >>
> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
> >>>
> >>> However, there is just relative url like this
> >>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx<
> >>
> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
> >>>
> >>>
> >>> Will nutch crawl/fetch websites with relatives url's by default i.e
> with
> >> no
> >>> additional configurations? Also I am not sure how to set regular
> >> expression
> >>> so these pages will be fetched. I want to fetch all the pages which
> >> starts
> >>> with http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/.
> >>> Thank you.
> >>>
> >>> Regards,
> >>> Sandeep
> >>>
> >>
> >>
> >
>
>

Reply via email to