RE: Re-crawl every 24 hours

Markus Jelsma Fri, 23 May 2014 05:13:48 -0700

That will work, but use nutch.fetchInterval.fixed in case you use an adaptive 
fetch scheduler.


 
 
-----Original message-----
> From:Julien Nioche <[email protected]>
> Sent: Friday 23rd May 2014 12:09
> To: [email protected]
> Subject: Re: Re-crawl every 24 hours
> 
> Hi
> 
> This will work with 1.8 indeed. What procedure do you mean? Just add
> nutch.fetchInterval to the seeds, that's all.
> 
> J.
> 
> 
> On 23 May 2014 10:13, Ali Nazemian <[email protected]> wrote:
> 
> > Dear Julien,
> > Hi,
> > Do you know any step by step guide for this procedure? Is this the same for
> > nutch 1.8?
> > Best regards.
> >
> >
> > On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> > [email protected]> wrote:
> >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
> > > means that a page which has already been fetched will be refetched again
> > > after 30mins. This is what you want for the seeds but is also applied to
> > > the subpages you've already discovered in previous rounds.
> > >
> > > What you could do would be to set a custom fetch interval for the seeds
> > > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > > nutch.fetchInterval) and have a larger value for
> > db.fetch.interval.default.
> > > This way the seeds would be revisited frequently but not the subpages.
> > Note
> > > that this would work only if the links to the pages you want to discover
> > > are directly in the seed files. If they are at a deeper level then they'd
> > > be discovered only when the page that mentions them is re-fetched (==
> > > nutch.fetchInterval)
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > > On 21 May 2014 11:22, Ali rahmani <[email protected]> wrote:
> > >
> > > > Dear Sir,
> > > > I am customizing Nutch 2.2 to crawl my seed lists which contains about
> > 30
> > > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > > added links. I added the following configurations to nutch-site.xml
> > file
> > > > and use the following command:
> > > >
> > > > <property>
> > > >   <name>db.fetch.interval.default</name>
> > > >   <value>1800</value>
> > > >   <description>The default number of seconds between re-fetches of a
> > page
> > > > (30 days).
> > > >   </description>
> > > > </property>
> > > >
> > > > <property>
> > > >   <name>db.update.purge.404</name>
> > > >   <value>true</value>
> > > >   <description>If true, updatedb will add purge records with status
> > > DB_GONE
> > > >   from the CrawlDB.
> > > >   </description>
> > > > </property>
> > > >
> > > >
> > > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > > >
> > > >
> > > > but whenever I run mention command, nutch goes deep and deeper.
> > > > would you please tell where is the problem ?
> > > > Regards,
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

RE: Re-crawl every 24 hours

Reply via email to