Re: Ability to determine number of pages for crawling

Harry Nutch Sun, 23 May 2010 18:21:15 -0700

Hi Artyom ,

In that case, I am assuming you checked regex-urlfilter.txt. If i am not
mistaken, for a complete web crawl, nutch uses that file instead of
crawl-urlfilter.
Other things  you may wanna consider


1) db.max.outlinks.per.page in nutch-default.xml. It limits the no. of
outlinks it traverses. Try it with value -1
2) Make sure the outlinks that you menntion are not prohibited by robots.txt
( check www.cnn.com/robots.txt)
3) Check http.content.limit in nutch-default.  It limits the content
downloaded from a page which in turn limits the no. of outlinks founs. Try
it with value -1

If all else fails, debug through method getOutlinks in DOMContentUtils.java
:-)

Harry



On Thu, May 20, 2010 at 7:06 PM, Artyom Shvedchikov <[email protected]>wrote:

> Hello, thanks for the fast reply.
> We do not use crawl tool, we use runbot script from Nutch wiki for
> whole-web crawling (it makes generate/fetch/update cycle using depth
> parameter as cycle count). So crawl-urlfiler.xml does not work in such case.
> Also we do not use any other plug-in for url filtering. But we
> set db.ignore.external.links to true for skipping external links.
> Our goal is go grab determined number of pages from only one determined
> site. For example 1000 pages from only cnn.com or its subdomains.
>
> -------------------------------------------------
> Best wishes, Artyom Shvedchikov
>
>
>
> On Thu, May 20, 2010 at 8:10 AM, Harry Nutch <[email protected]> wrote:
>
>> You need to give more information. what does hadoop.log say?  Try running
>> with the debug log setting.
>> One reason could be your settings in crawl-urlfilter. Do all those unique
>> links point to sub domains on cnn.com or are they links to some other
>> websites. If they are outside of cnn then they might now be traversed
>> depending on entries in crawl-urlfilter.txt. Also, even for web-pages on
>> cnn
>> domain the particular path needs to meet different  regex rules present in
>> crawl-urlfilter.txt
>>
>>
>> On Thu, May 20, 2010 at 2:42 AM, Artyom Shvedchikov <[email protected]
>> >wrote:
>>
>> > Hi Nutch community.
>> >
>> > We are trying to solve such task with the help of nutch:
>> >  User give to us path on site and number of pages to grab. For example
>> > http://www.cnn.com/ and 100 pages.
>> >  We start nutch with settings depth = 2 topN=100.
>> >  As result we receive only 16 pages.
>> >  When we start nutch with settings depth = 2 topN=1000 we still receive
>> 17
>> > pages.
>> >
>> >  But on the home page of cnn.com there near 50 unique links.
>> >
>> >  If anyone can explain how we can make nutch to get determined amount of
>> > pages from site we will be very appreciate.
>> >
>> > Thanks in advance.
>> > -------------------------------------------------
>> > Best wishes, Artyom Shvedchikov
>> >
>>
>
>

Re: Ability to determine number of pages for crawling

Reply via email to