Hi, Your fetch interval is very little. fetchInterval unit is ms. 2592000 ms is equal approximent 43 min. When do you start your second depth ? If after 43 min. This is normal.
Talat 2014-05-11 19:20 GMT+03:00 Vangelis karv <[email protected]>: > XML: > <name>http.agent.name</name> > <value>RiSpider</value> > > <name>http.robots.agents</name> > <value>RiSpider,*</value> > > <name>http.content.limit</name> > <value>-1</value> > > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-(domain|regex)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic|microformats-reltag</value> > > <name>fetcher.queue.mode</name> > <value>byDomain</value> > > <name>http.accept.language</name> > <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> > > <name>db.update.max.inlinks</name> > <value>20000</value> > > <name>parser.character.encoding.default</name> > <value>utf-8</value> > > <name>storage.data.store.class</name> > <value>org.apache.gora.sql.store.SqlStore</value> > > <name>moreIndexingFilter.indexMimeTypeParts</name> > <value>false</value> > > <name>fetcher.server.delay</name> > <value>0.0</value> > > <name>parser.timeout</name> > <value>-1</value> > > <name>gora.buffer.read.limit</name> > <value>5000</value> > > <name>gora.buffer.write.limit</name> > <value>5000</value> > > <name>index.parse.md</name> > <value>*</value> > > <name>metatags.names</name> > <value>*</value> > > > MySQL fields at depth=10, topN=500: > > Seed Url: uk.co.dailymail.www:http/home/index.html > uk.co.dailymail.www:http/home/index.html, ..., Home | Mail Online ,status: 2, > ..., ..., , , score: 1.0198, typ: application/xhtml+xml, batchID: > 1399744393-1426553032, http://www.dailymail.co.uk/home/index.html , ..., Home > | Mail Online, , fetchInterval:2592000, prevfetchTime: 1402353412236, ..., > ..., ..., fetchTime: 1402360203750, , ..., ..., ... > > > uk.co.dailymail.www:http/terms, ..., , 3, ..., , , , 0.0197911, text/html, > 1399744408-1706414367, http://www.dailymail.co.uk/terms, , , > http://www.dailymail.co.uk/terms, 3888000, 1402353098818, ..., , ..., > 1403656233583, , ..., , ... > > All the other urls in the database have either fetchInterval 2592000 or > 3888000. > > Any ideas? > > > > > >> Date: Sun, 11 May 2014 13:46:27 +0300 >> Subject: Re: Fetcher-Parser Nutch 2.2.1 >> From: [email protected] >> To: [email protected] >> >> Hi Vangelis, >> >> Maybe your interval time is very little. That is caused fething every >> depth. Can you share nutch-site.xml and url's f coloumn fields and values. >> >> Talat >> 11 May 2014 02:30 tarihinde "Vangelis karv" <[email protected]> yazdı: >> >> > Hi everyone! >> > >> > Let's say we start a crawl with depth 5 and topN 500 and www.something.com, >> > with domain(www.something.com) and regex urlfilters. >> > I have noticed that the url: www.something.com is fetched, parsed and >> > updated in every depth. Why is that happening? >> > In my opinion the particular url should be fetched and parsed only in the >> > 1st depth and updated in every depth. >> > >> > Thank you in advance, >> > Vangelis >> > > > > > -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

