Re: Re-crawl every 24 hours

Ali rahmani Fri, 23 May 2014 03:41:38 -0700

Hi Julien, 
Would you please guide me how a re-Crawling Script should be. I pass following 
steps(even after adding fetch.interval parameter), crawler goes deep and 
deeper. 
1) ./nutch Inject /url
2)Loop{
./nutch generate -topN 2000
./nutch fetch [CrwalID]
./nutch parse [CrawlID]
./nutch generatedb
}


It is worth mention to say that I pass mentioned steps after 24 hours.
Regards,
A.R


On Friday, May 23, 2014 2:39:13 PM, Julien Nioche 
<[email protected]> wrote:
 


Hi

This will work with 1.8 indeed. What procedure do you mean? Just add
nutch.fetchInterval to the seeds, that's all.

J.


On 23 May 2014 10:13, Ali Nazemian <[email protected]> wrote:

> Dear Julien,
> Hi,
> Do you know any step by step guide for this procedure? Is this the same for
> nutch 1.8?
> Best regards.
>
>
> On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> [email protected]> wrote:
>
> > <property>
> >   <name>db.fetch.interval.default</name>
> >  
 <value>1800</value>
> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > means that a page which has already been fetched will be refetched again
> > after 30mins. This is what you want for the seeds but is also applied to
> > the subpages you've already discovered in previous rounds.
> >
> > What you could do would be to set a custom fetch interval for the seeds
> > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > nutch.fetchInterval) and have a larger
 value for
> db.fetch.interval.default.
> > This way the seeds would be revisited frequently but not the subpages.
> Note
> > that this would work only if the links to the pages you want to discover
> > are directly in the seed files. If they are at a deeper level then they'd
> > be discovered only when the page that mentions them is re-fetched (==
> > nutch.fetchInterval)
> >
> > HTH
> >
> > Julien
> >
> >
> > On 21 May 2014 11:22, Ali rahmani <[email protected]> wrote:
> >
> > > Dear Sir,
> >
 > I am customizing Nutch 2.2 to crawl my seed lists which contains about
> 30
> > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > added links. I added the following configurations to nutch-site.xml
> file
> > > and use the following command:
> > >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a
> page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
>
 > > <property>
> > >   <name>db.update.purge.404</name>
> > >   <value>true</value>
> > >   <description>If true, updatedb will add purge records with status
> > DB_GONE
> > >   from the CrawlDB.
> > >   </description>
> > > </property>
> > >
> > >
> > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > >
> > >
> > > but whenever I run mention command, nutch goes deep and deeper.
> > > would you please tell where is the problem ?
> > >
 Regards,
> >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> A.Nazemian

>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Re-crawl every 24 hours

Reply via email to