Thank you Talat in advance for helping me so much! How can I get rid of that refetching? I have created a loop in Eclipse that starts second depth immediately after it finishes the first(the 3rd after the 2nd etc). Do I need to change something in nutch-site.xml? Keep in mind that I am interested only in small-time crawls approximately around 50k pages from the same domain and then I truncate my MySQL database and restart the crawler with another seed domain.
> Date: Sun, 11 May 2014 23:09:43 +0300 > Subject: Re: Fetcher-Parser Nutch 2.2.1 > From: [email protected] > To: [email protected] > > Hi, > > Your fetch interval is very little. fetchInterval unit is ms. 2592000 > ms is equal approximent 43 min. When do you start your second depth ? > If after 43 min. This is normal. > > Talat > > 2014-05-11 19:20 GMT+03:00 Vangelis karv <[email protected]>: > > XML: > > <name>http.agent.name</name> > > <value>RiSpider</value> > > > > <name>http.robots.agents</name> > > <value>RiSpider,*</value> > > > > <name>http.content.limit</name> > > <value>-1</value> > > > > <name>plugin.includes</name> > > > > <value>protocol-http|urlfilter-(domain|regex)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic|microformats-reltag</value> > > > > <name>fetcher.queue.mode</name> > > <value>byDomain</value> > > > > <name>http.accept.language</name> > > <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> > > > > <name>db.update.max.inlinks</name> > > <value>20000</value> > > > > <name>parser.character.encoding.default</name> > > <value>utf-8</value> > > > > <name>storage.data.store.class</name> > > <value>org.apache.gora.sql.store.SqlStore</value> > > > > <name>moreIndexingFilter.indexMimeTypeParts</name> > > <value>false</value> > > > > <name>fetcher.server.delay</name> > > <value>0.0</value> > > > > <name>parser.timeout</name> > > <value>-1</value> > > > > <name>gora.buffer.read.limit</name> > > <value>5000</value> > > > > <name>gora.buffer.write.limit</name> > > <value>5000</value> > > > > <name>index.parse.md</name> > > <value>*</value> > > > > <name>metatags.names</name> > > <value>*</value> > > > > > > MySQL fields at depth=10, topN=500: > > > > Seed Url: uk.co.dailymail.www:http/home/index.html > > uk.co.dailymail.www:http/home/index.html, ..., Home | Mail Online ,status: > > 2, ..., ..., , , score: 1.0198, typ: application/xhtml+xml, batchID: > > 1399744393-1426553032, http://www.dailymail.co.uk/home/index.html , ..., > > Home | Mail Online, , fetchInterval:2592000, prevfetchTime: 1402353412236, > > ..., ..., ..., fetchTime: 1402360203750, , ..., ..., ... > > > > > > uk.co.dailymail.www:http/terms, ..., , 3, ..., , , , 0.0197911, text/html, > > 1399744408-1706414367, http://www.dailymail.co.uk/terms, , , > > http://www.dailymail.co.uk/terms, 3888000, 1402353098818, ..., , ..., > > 1403656233583, , ..., , ... > > > > All the other urls in the database have either fetchInterval 2592000 or > > 3888000. > > > > Any ideas? > > > > > > > > > > > >> Date: Sun, 11 May 2014 13:46:27 +0300 > >> Subject: Re: Fetcher-Parser Nutch 2.2.1 > >> From: [email protected] > >> To: [email protected] > >> > >> Hi Vangelis, > >> > >> Maybe your interval time is very little. That is caused fething every > >> depth. Can you share nutch-site.xml and url's f coloumn fields and values. > >> > >> Talat > >> 11 May 2014 02:30 tarihinde "Vangelis karv" <[email protected]> > >> yazdı: > >> > >> > Hi everyone! > >> > > >> > Let's say we start a crawl with depth 5 and topN 500 and > >> > www.something.com, > >> > with domain(www.something.com) and regex urlfilters. > >> > I have noticed that the url: www.something.com is fetched, parsed and > >> > updated in every depth. Why is that happening? > >> > In my opinion the particular url should be fetched and parsed only in the > >> > 1st depth and updated in every depth. > >> > > >> > Thank you in advance, > >> > Vangelis > >> > > > > > > > > > > > > > -- > Talat UYARER > Websitesi: http://talat.uyarer.com > Twitter: http://twitter.com/talatuyarer > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

