Re: Ability to determine number of pages for crawling

Artyom Shvedchikov Mon, 24 May 2010 01:49:23 -0700

Hello, Harry
Thanks a lot for your answer.
Yes, we checked regex-urlfilter.txt file and it contains only default
restrictions.
We changed "db.max.outlinks.per.page" to -1 and count of pages was
increased. Also we set "http.content.limit" to -1 and another problem with
our custom plug-in was solved :))
-------------------------------------------------
Best wishes, Artyom Shvedchikov



On Mon, May 24, 2010 at 4:20 AM, Harry Nutch <[email protected]> wrote:

> Hi Artyom ,
>
> In that case, I am assuming you checked regex-urlfilter.txt. If i am not
> mistaken, for a complete web crawl, nutch uses that file instead of
> crawl-urlfilter.
> Other things  you may wanna consider
>
> 1) db.max.outlinks.per.page in nutch-default.xml. It limits the no. of
> outlinks it traverses. Try it with value -1
> 2) Make sure the outlinks that you menntion are not prohibited by
> robots.txt
> ( check 
> www.cnn.comdb.max.outlinks.per.page/robots.txt<http://www.cnn.com/robots.txt>
> )
> 3) Check http.content.limit in nutch-default.  It limits the content
> downloaded from a page which in turn limits the no. of outlinks founs. Try
> it with value -1
>
> If all else fails, debug through method getOutlinks in DOMContentUtils.java
> :-)
>
> Harry
>
>
>
> On Thu, May 20, 2010 at 7:06 PM, Artyom Shvedchikov <[email protected]
> >wrote:
>
> > Hello, thanks for the fast reply.
> > We do not use crawl tool, we use runbot script from Nutch wiki for
> > whole-web crawling (it makes generate/fetch/update cycle using depth
> > parameter as cycle count). So crawl-urlfiler.xml does not work in such
> case.
> > Also we do not use any other plug-in for url filtering. But we
> > set db.ignore.external.links to true for skipping external links.
> > Our goal is go grab determined number of pages from only one determined
> > site. For example 1000 pages from only cnn.com or its subdomains.
> >
> > -------------------------------------------------
> > Best wishes, Artyom Shvedchikov
> >
> >
> >
> > On Thu, May 20, 2010 at 8:10 AM, Harry Nutch <[email protected]>
> wrote:
> >
> >> You need to give more information. what does hadoop.log say?  Try
> running
> >> with the debug log setting.
> >> One reason could be your settings in crawl-urlfilter. Do all those
> unique
> >> links point to sub domains on cnn.com or are they links to some other
> >> websites. If they are outside of cnn then they might now be traversed
> >> depending on entries in crawl-urlfilter.txt. Also, even for web-pages on
> >> cnn
> >> domain the particular path needs to meet different  regex rules present
> in
> >> crawl-urlfilter.txt
> >>
> >>
> >> On Thu, May 20, 2010 at 2:42 AM, Artyom Shvedchikov <[email protected]
> >> >wrote:
> >>
> >> > Hi Nutch community.
> >> >
> >> > We are trying to solve such task with the help of nutch:
> >> >  User give to us path on site and number of pages to grab. For example
> >> > http://www.cnn.com/ and 100 pages.
> >> >  We start nutch with settings depth = 2 topN=100.
> >> >  As result we receive only 16 pages.
> >> >  When we start nutch with settings depth = 2 topN=1000 we still
> receive
> >> 17
> >> > pages.
> >> >
> >> >  But on the home page of cnn.com there near 50 unique links.
> >> >
> >> >  If anyone can explain how we can make nutch to get determined amount
> of
> >> > pages from site we will be very appreciate.
> >> >
> >> > Thanks in advance.
> >> > -------------------------------------------------
> >> > Best wishes, Artyom Shvedchikov
> >> >
> >>
> >
> >
>

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(https|telnet|file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# 
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

Re: Ability to determine number of pages for crawling

Reply via email to