RE: Fetcher-Parser Nutch 2.2.1

Vangelis karv Sun, 11 May 2014 09:48:12 -0700

XML:
<name>http.agent.name</name>
<value>RiSpider</value>


  <name>http.robots.agents</name>
  <value>RiSpider,*</value>

  <name>http.content.limit</name>
  <value>-1</value>

  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-(domain|regex)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic|microformats-reltag</value>

  <name>fetcher.queue.mode</name>
  <value>byDomain</value>

<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>

  <name>db.update.max.inlinks</name>
  <value>20000</value>

<name>parser.character.encoding.default</name>
<value>utf-8</value>

<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>

  <name>moreIndexingFilter.indexMimeTypeParts</name>
  <value>false</value>

  <name>fetcher.server.delay</name>
  <value>0.0</value>
 
  <name>parser.timeout</name>
  <value>-1</value>
  
  <name>gora.buffer.read.limit</name>
  <value>5000</value>
  
  <name>gora.buffer.write.limit</name>
  <value>5000</value>

  <name>index.parse.md</name>
  <value>*</value>

  <name>metatags.names</name>
  <value>*</value>
 

MySQL fields at depth=10, topN=500:

Seed Url:   uk.co.dailymail.www:http/home/index.html 
uk.co.dailymail.www:http/home/index.html, ..., Home | Mail Online ,status: 2, 
..., ..., , , score: 1.0198, typ: application/xhtml+xml, batchID: 
1399744393-1426553032, http://www.dailymail.co.uk/home/index.html , ..., Home | 
Mail Online, , fetchInterval:2592000, prevfetchTime: 1402353412236, ..., ..., 
..., fetchTime: 1402360203750, , ..., ..., ...


uk.co.dailymail.www:http/terms, ..., , 3, ..., , , , 0.0197911, text/html, 
1399744408-1706414367, http://www.dailymail.co.uk/terms, , , 
http://www.dailymail.co.uk/terms, 3888000, 1402353098818, ..., , ..., 
1403656233583, , ..., , ...

All the other urls in the database have either fetchInterval 2592000 or 
3888000. 

Any ideas?

  



> Date: Sun, 11 May 2014 13:46:27 +0300
> Subject: Re: Fetcher-Parser Nutch 2.2.1
> From: [email protected]
> To: [email protected]
> 
> Hi Vangelis,
> 
> Maybe your interval time is very little. That is caused fething every
> depth. Can you share nutch-site.xml and url's f coloumn fields and values.
> 
> Talat
> 11 May 2014 02:30 tarihinde "Vangelis karv" <[email protected]> yazdı:
> 
> > Hi everyone!
> >
> > Let's say we start a crawl with depth 5 and topN 500 and www.something.com,
> > with domain(www.something.com) and regex urlfilters.
> > I have noticed that the url: www.something.com is fetched, parsed and
> > updated in every depth. Why is that happening?
> > In my opinion the particular url should be fetched and parsed only in the
> > 1st depth and updated in every depth.
> >
> > Thank you in advance,
> > Vangelis
> >

RE: Fetcher-Parser Nutch 2.2.1

Reply via email to