Hello, Harry Thanks a lot for your answer. Yes, we checked regex-urlfilter.txt file and it contains only default restrictions. We changed "db.max.outlinks.per.page" to -1 and count of pages was increased. Also we set "http.content.limit" to -1 and another problem with our custom plug-in was solved :)) ------------------------------------------------- Best wishes, Artyom Shvedchikov
On Mon, May 24, 2010 at 4:20 AM, Harry Nutch <[email protected]> wrote: > Hi Artyom , > > In that case, I am assuming you checked regex-urlfilter.txt. If i am not > mistaken, for a complete web crawl, nutch uses that file instead of > crawl-urlfilter. > Other things you may wanna consider > > 1) db.max.outlinks.per.page in nutch-default.xml. It limits the no. of > outlinks it traverses. Try it with value -1 > 2) Make sure the outlinks that you menntion are not prohibited by > robots.txt > ( check > www.cnn.comdb.max.outlinks.per.page/robots.txt<http://www.cnn.com/robots.txt> > ) > 3) Check http.content.limit in nutch-default. It limits the content > downloaded from a page which in turn limits the no. of outlinks founs. Try > it with value -1 > > If all else fails, debug through method getOutlinks in DOMContentUtils.java > :-) > > Harry > > > > On Thu, May 20, 2010 at 7:06 PM, Artyom Shvedchikov <[email protected] > >wrote: > > > Hello, thanks for the fast reply. > > We do not use crawl tool, we use runbot script from Nutch wiki for > > whole-web crawling (it makes generate/fetch/update cycle using depth > > parameter as cycle count). So crawl-urlfiler.xml does not work in such > case. > > Also we do not use any other plug-in for url filtering. But we > > set db.ignore.external.links to true for skipping external links. > > Our goal is go grab determined number of pages from only one determined > > site. For example 1000 pages from only cnn.com or its subdomains. > > > > ------------------------------------------------- > > Best wishes, Artyom Shvedchikov > > > > > > > > On Thu, May 20, 2010 at 8:10 AM, Harry Nutch <[email protected]> > wrote: > > > >> You need to give more information. what does hadoop.log say? Try > running > >> with the debug log setting. > >> One reason could be your settings in crawl-urlfilter. Do all those > unique > >> links point to sub domains on cnn.com or are they links to some other > >> websites. If they are outside of cnn then they might now be traversed > >> depending on entries in crawl-urlfilter.txt. Also, even for web-pages on > >> cnn > >> domain the particular path needs to meet different regex rules present > in > >> crawl-urlfilter.txt > >> > >> > >> On Thu, May 20, 2010 at 2:42 AM, Artyom Shvedchikov <[email protected] > >> >wrote: > >> > >> > Hi Nutch community. > >> > > >> > We are trying to solve such task with the help of nutch: > >> > User give to us path on site and number of pages to grab. For example > >> > http://www.cnn.com/ and 100 pages. > >> > We start nutch with settings depth = 2 topN=100. > >> > As result we receive only 16 pages. > >> > When we start nutch with settings depth = 2 topN=1000 we still > receive > >> 17 > >> > pages. > >> > > >> > But on the home page of cnn.com there near 50 unique links. > >> > > >> > If anyone can explain how we can make nutch to get determined amount > of > >> > pages from site we will be very appreciate. > >> > > >> > Thanks in advance. > >> > ------------------------------------------------- > >> > Best wishes, Artyom Shvedchikov > >> > > >> > > > > >
# Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(https|telnet|file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +.

