Re: Fetching just some urls outside domain

Adriana Farina Fri, 02 Dec 2011 01:20:32 -0800

I setted nutch-site.xml in the following way:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you
specified in the
    Urls file but it is jumping into the parent directories as well. For
your own crawlings you can
    change this bahavior (set to false) the way that only directories
beneath the directories that you specify get
    crawled.</description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>My Nutch Spider, *</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.max.delays</name>
  <value>10</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

<property>
  <name>http.accept.language</name>
  <value>it, en;q=0.7,*;q=0.3</value>
  <description>Value of the "Accept-Language" request header field.
  This allows selecting non-English language as default one to retrieve.
  It is a useful setting for search engines build for certain national
group.
  </description>
</property>

<property>
  <name>http.verbose</name>
  <value>true</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

<property>
  <name>http.redirect.max</name>
  <value>3</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

<property>
  <name>db.max.outlinks.per.page</name>
  <value>1000</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

<property>
  <name>fetcher.server.delay</name>
  <value>2.0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

<property>
  <name>fetcher.threads.fetch</name>
  <value>8</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>5</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

<property>
  <name>fetcher.parse</name>
  <value>true</value>
  <description>If true, fetcher will parse content. Default is false, which
means
  that a separate parsing step is required after fetching is
finished.</description>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika|pdf|doc)|index-(basic|anchor)|urlmeta|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please
enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  </description>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

<property>
  <name>urlmeta.tags</name>
  <value>idadmin</value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which
allows
    for custom metatags to be injected alongside your crawl URLs.
Specifying those
    custom tags here will allow for their propagation into a pages
outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the
tags with
    white-space at their boundaries, if you are using anything earlier than
Hadoop-0.21.
  </description>
</property>
</configuration>

2011/12/1 Lewis John Mcgibbney <[email protected]>

> If you also provide the settings from nutch-site.xml which restrict's your
> Nutchbot from crawling outside some specified domain that would be helpful.
>
> At this stage I think that if your restrictions completely deny Nutch from
> following outlinks to other domains, then the use of reg-ex filters is
> pointless. This is not what you wish to be configuring. Instead you want to
> be allowing Nutch to crawl outlinks to other domains but limit which
> domains you wish to crawl. I think it should be possible to add the filters
> in your reg-ex file like
>
> # accept the following but block everything else
>
> +^http://([a-z0-9]*\.)*somesite.it/
> +^http://([a-z0-9]*\.)*aaa.it/
> +^http://([a-z0-9]*\.)*bbb.it/
> etc
>
> I don't think you will need to explicitly deny everything else. However
> you'll only find out by doing a number of small test crawls to check out
> whether your reg-ex filters are working
>
> HTH
>
> On Thu, Dec 1, 2011 at 8:57 AM, Adriana Farina
> <[email protected]>wrote:
>
> > Hi!
> >
> > Thank you for your answer. You're right, maybe an example would explain
> > better what I need to do.
> >
> > I have to perform the following task. I have to explore a specific domain
> > (.
> > gov.it) and I have an initial set of seeds, for example www.aaa.it,
> > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
> > pages outside that domain. However some resources I need to download
> > (documents) are stored on web sites that are not inside the domain I'm
> > interested in.
> > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it(where
> > www.somesite.it is not inside "my" domain). Nutch will not fetch that
> page
> > since I told it to behave that way, but I need to download documents
> stored
> > on www.somesite.it. So I need nutch to go outside the domain I specified
> > only when it sees the words "albi" or "albo" inside the url, since that
> > words identify the documents I need. How can I do this?
> >
> > I hope I've been clear. :)
> >
> >
> >
> > 2011/11/30 Lewis John Mcgibbney <[email protected]>
> >
> > > Hi Adriana,
> > >
> > > This should be achievable through fine grained URL filters. It is
> kindof
> > > hard to substantiate on this without you providing some examples of the
> > > type of stuff you're trying to do!
> > >
> > > Lewis
> > >
> > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> > > [email protected]
> > > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I
> > configured
> > > > it so that it doesn't fetch pages outside a specific domain. However
> > now
> > > I
> > > > need to let it fetch pages outside the domain I choosed but only for
> > some
> > > > urls (not for all the urls I have to crawl). How can I do this? I
> have
> > to
> > > > write a new plugin?
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Fetching just some urls outside domain

Reply via email to