Hi This will work with 1.8 indeed. What procedure do you mean? Just add nutch.fetchInterval to the seeds, that's all.
J. On 23 May 2014 10:13, Ali Nazemian <[email protected]> wrote: > Dear Julien, > Hi, > Do you know any step by step guide for this procedure? Is this the same for > nutch 1.8? > Best regards. > > > On Wed, May 21, 2014 at 6:43 PM, Julien Nioche < > [email protected]> wrote: > > > <property> > > <name>db.fetch.interval.default</name> > > <value>1800</value> > > <description>The default number of seconds between re-fetches of a page > > (30 days). > > </description> > > </property> > > > > means that a page which has already been fetched will be refetched again > > after 30mins. This is what you want for the seeds but is also applied to > > the subpages you've already discovered in previous rounds. > > > > What you could do would be to set a custom fetch interval for the seeds > > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of > > nutch.fetchInterval) and have a larger value for > db.fetch.interval.default. > > This way the seeds would be revisited frequently but not the subpages. > Note > > that this would work only if the links to the pages you want to discover > > are directly in the seed files. If they are at a deeper level then they'd > > be discovered only when the page that mentions them is re-fetched (== > > nutch.fetchInterval) > > > > HTH > > > > Julien > > > > > > On 21 May 2014 11:22, Ali rahmani <[email protected]> wrote: > > > > > Dear Sir, > > > I am customizing Nutch 2.2 to crawl my seed lists which contains about > 30 > > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new > > > added links. I added the following configurations to nutch-site.xml > file > > > and use the following command: > > > > > > <property> > > > <name>db.fetch.interval.default</name> > > > <value>1800</value> > > > <description>The default number of seconds between re-fetches of a > page > > > (30 days). > > > </description> > > > </property> > > > > > > <property> > > > <name>db.update.purge.404</name> > > > <value>true</value> > > > <description>If true, updatedb will add purge records with status > > DB_GONE > > > from the CrawlDB. > > > </description> > > > </property> > > > > > > > > > ./crawl urls/ testdb http://localhost:8983/solr 2 > > > > > > > > > but whenever I run mention command, nutch goes deep and deeper. > > > would you please tell where is the problem ? > > > Regards, > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > > > > -- > A.Nazemian > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

