Ok thanks for the help!
I have another problem I'm trying to crawl this example of a site:
http://dreamdj.altervista.org/
with the following command:

nutch crawl -dir crawl  -depth 5 -topN 3

Why do I get only the first page? other links do not appear in the results!
This is the file nutch-site:
 <property>
  <name>http.agent.name</name>
  <value>NLP</value>
 </property>

<property>
  <name>http.robots.agents</name>
  <value>NLP,*</value>
</property> 

<property>
  <name>plugin.folders</name>
  <value>/home/enzo/Scrivania/nutch/apache-nutch-1.7/src/plugin</value>
</property>

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
</property>


while the file "regex-urlfilter" I left it by default.
Why not capture the other links?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/help-me-with-nutch-tp4095914p4096998.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to