Re: Ability to determine number of pages for crawling

Harry Nutch Wed, 19 May 2010 22:11:07 -0700

You need to give more information. what does hadoop.log say?  Try running
with the debug log setting.
One reason could be your settings in crawl-urlfilter. Do all those unique
links point to sub domains on cnn.com or are they links to some other
websites. If they are outside of cnn then they might now be traversed
depending on entries in crawl-urlfilter.txt. Also, even for web-pages on cnn
domain the particular path needs to meet different  regex rules present in
crawl-urlfilter.txt



On Thu, May 20, 2010 at 2:42 AM, Artyom Shvedchikov <[email protected]>wrote:

> Hi Nutch community.
>
> We are trying to solve such task with the help of nutch:
>  User give to us path on site and number of pages to grab. For example
> http://www.cnn.com/ and 100 pages.
>  We start nutch with settings depth = 2 topN=100.
>  As result we receive only 16 pages.
>  When we start nutch with settings depth = 2 topN=1000 we still receive 17
> pages.
>
>  But on the home page of cnn.com there near 50 unique links.
>
>  If anyone can explain how we can make nutch to get determined amount of
> pages from site we will be very appreciate.
>
> Thanks in advance.
> -------------------------------------------------
> Best wishes, Artyom Shvedchikov
>

Re: Ability to determine number of pages for crawling

Reply via email to