RE: Fetcher-Parser Nutch 2.2.1

Vangelis karv Fri, 16 May 2014 15:56:36 -0700

I think patch-1651 https://issues.apache.org/jira/browse/NUTCH-1651 solved my 
problem.
 
From: [email protected]
To: [email protected]
Subject: RE: Fetcher-Parser Nutch 2.2.1
Date: Mon, 12 May 2014 12:20:52 +0300





Thank you Talat in advance for helping me so much!

How can I get rid of that refetching? I have created a loop in Eclipse that 
starts second depth immediately after it finishes the first(the 3rd after the 
2nd etc). Do I need to change something in nutch-site.xml? Keep in mind that I 
am interested only in small-time crawls approximately around 50k pages from the 
same domain and then I truncate my MySQL database and restart the crawler with 
another seed domain.

> Date: Sun, 11 May 2014 23:09:43 +0300
> Subject: Re: Fetcher-Parser Nutch 2.2.1
> From: [email protected]
> To: [email protected]
> 
> Hi,
> 
> Your fetch interval is very little. fetchInterval unit is ms. 2592000
> ms is equal approximent 43 min. When do you start your second depth ?
> If after 43 min. This is normal.
> 
> Talat
> 
> 2014-05-11 19:20 GMT+03:00 Vangelis karv <[email protected]>:
> > XML:
> > <name>http.agent.name</name>
> > <value>RiSpider</value>
> >
> >   <name>http.robots.agents</name>
> >   <value>RiSpider,*</value>
> >
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >
> >   <name>plugin.includes</name>
> >  
> > <value>protocol-http|urlfilter-(domain|regex)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic|microformats-reltag</value>
> >
> >   <name>fetcher.queue.mode</name>
> >   <value>byDomain</value>
> >
> > <name>http.accept.language</name>
> > <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> >
> >   <name>db.update.max.inlinks</name>
> >   <value>20000</value>
> >
> > <name>parser.character.encoding.default</name>
> > <value>utf-8</value>
> >
> > <name>storage.data.store.class</name>
> > <value>org.apache.gora.sql.store.SqlStore</value>
> >
> >   <name>moreIndexingFilter.indexMimeTypeParts</name>
> >   <value>false</value>
> >
> >   <name>fetcher.server.delay</name>
> >   <value>0.0</value>
> >
> >   <name>parser.timeout</name>
> >   <value>-1</value>
> >
> >   <name>gora.buffer.read.limit</name>
> >   <value>5000</value>
> >
> >   <name>gora.buffer.write.limit</name>
> >   <value>5000</value>
> >
> >   <name>index.parse.md</name>
> >   <value>*</value>
> >
> >   <name>metatags.names</name>
> >   <value>*</value>
> >
> >
> > MySQL fields at depth=10, topN=500:
> >
> > Seed Url:   uk.co.dailymail.www:http/home/index.html
> > uk.co.dailymail.www:http/home/index.html, ..., Home | Mail Online ,status: 
> > 2, ..., ..., , , score: 1.0198, typ: application/xhtml+xml, batchID: 
> > 1399744393-1426553032, http://www.dailymail.co.uk/home/index.html , ..., 
> > Home | Mail Online, , fetchInterval:2592000, prevfetchTime: 1402353412236, 
> > ..., ..., ..., fetchTime: 1402360203750, , ..., ..., ...
> >
> >
> > uk.co.dailymail.www:http/terms, ..., , 3, ..., , , , 0.0197911, text/html, 
> > 1399744408-1706414367, http://www.dailymail.co.uk/terms, , , 
> > http://www.dailymail.co.uk/terms, 3888000, 1402353098818, ..., , ..., 
> > 1403656233583, , ..., , ...
> >
> > All the other urls in the database have either fetchInterval 2592000 or 
> > 3888000.
> >
> > Any ideas?
> >
> >
> >
> >
> >
> >> Date: Sun, 11 May 2014 13:46:27 +0300
> >> Subject: Re: Fetcher-Parser Nutch 2.2.1
> >> From: [email protected]
> >> To: [email protected]
> >>
> >> Hi Vangelis,
> >>
> >> Maybe your interval time is very little. That is caused fething every
> >> depth. Can you share nutch-site.xml and url's f coloumn fields and values.
> >>
> >> Talat
> >> 11 May 2014 02:30 tarihinde "Vangelis karv" <[email protected]> 
> >> yazdı:
> >>
> >> > Hi everyone!
> >> >
> >> > Let's say we start a crawl with depth 5 and topN 500 and 
> >> > www.something.com,
> >> > with domain(www.something.com) and regex urlfilters.
> >> > I have noticed that the url: www.something.com is fetched, parsed and
> >> > updated in every depth. Why is that happening?
> >> > In my opinion the particular url should be fetched and parsed only in the
> >> > 1st depth and updated in every depth.
> >> >
> >> > Thank you in advance,
> >> > Vangelis
> >> >
> >
> >
> >
> >
> 
> 
> 
> -- 
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

RE: Fetcher-Parser Nutch 2.2.1

Reply via email to