Re: error in topN

Markus Jelsma Tue, 20 Dec 2011 11:09:03 -0800

Hi,

This is likely a URL filter problem. You can try the parsechecker tool 
(bin/nutch parsechecker <url>) to see how Nutch parses a page and what links 
it finds. This should be more or less the same as there are links on the page. 
Also, check your conf/regex-urlfilter, the missing URL's are likely filtered 
out by that plugin.


Also you may want to upgrade to the new 1.4, it comes with some important 
fixes and improvements.

Cheers,

> hi, i crawl one site that it has 100 link in depth 1, and 100 links in
> depth 2, but nutch only crawl 23 links from depth 1 and 30 from depth 2.
> how can i force nutch to crawl all links in depth 1 and 2. i use nutch 1.3
> topN=10000
> depth =2
> and in my nutch-site.xml:
> <property>
>         <name>http.content.limit</name>
>         <value>-1</value>
>         <description>
>   </description>
>     </property>
>  <property>
>         <name>http.agent.name</name>
>         <value>My Nutch Spider</value>
>         <description>
>   </description>
>     </property>
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/error-in-topN-tp3601000p3601000.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: error in topN

Reply via email to