Re: Ability to determine number of pages for crawling

Artyom Shvedchikov Thu, 20 May 2010 04:06:58 -0700

Hello, thanks for the fast reply.
We do not use crawl tool, we use runbot script from Nutch wiki for whole-web
crawling (it makes generate/fetch/update cycle using depth parameter as
cycle count). So crawl-urlfiler.xml does not work in such case. Also we do
not use any other plug-in for url filtering. But we
set db.ignore.external.links to true for skipping external links.
Our goal is go grab determined number of pages from only one determined
site. For example 1000 pages from only cnn.com or its subdomains.


-------------------------------------------------
Best wishes, Artyom Shvedchikov


On Thu, May 20, 2010 at 8:10 AM, Harry Nutch <[email protected]> wrote:

> You need to give more information. what does hadoop.log say?  Try running
> with the debug log setting.
> One reason could be your settings in crawl-urlfilter. Do all those unique
> links point to sub domains on cnn.com or are they links to some other
> websites. If they are outside of cnn then they might now be traversed
> depending on entries in crawl-urlfilter.txt. Also, even for web-pages on
> cnn
> domain the particular path needs to meet different  regex rules present in
> crawl-urlfilter.txt
>
>
> On Thu, May 20, 2010 at 2:42 AM, Artyom Shvedchikov <[email protected]
> >wrote:
>
> > Hi Nutch community.
> >
> > We are trying to solve such task with the help of nutch:
> >  User give to us path on site and number of pages to grab. For example
> > http://www.cnn.com/ and 100 pages.
> >  We start nutch with settings depth = 2 topN=100.
> >  As result we receive only 16 pages.
> >  When we start nutch with settings depth = 2 topN=1000 we still receive
> 17
> > pages.
> >
> >  But on the home page of cnn.com there near 50 unique links.
> >
> >  If anyone can explain how we can make nutch to get determined amount of
> > pages from site we will be very appreciate.
> >
> > Thanks in advance.
> > -------------------------------------------------
> > Best wishes, Artyom Shvedchikov
> >
>

runbot.sh
Description: Bourne shell script

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>NTNG-test</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>Testing purposes</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://www.notagsnoglory.com/</value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>NTNG-test</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NTNG-test,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

<property>
  <name>fetcher.verbose</name> 
  <value>false</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>
            
<property>
  <name>http.verbose</name>
  <value>false</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>   

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
        the same host are ignored.  This is an effective way to limit the
	  size of the link database, keeping only the highest quality
	    links.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor|wa|more)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

</configuration>

Re: Ability to determine number of pages for crawling

Reply via email to