Ok thanks for the help! I have another problem I'm trying to crawl this example of a site: http://dreamdj.altervista.org/ with the following command:
nutch crawl -dir crawl -depth 5 -topN 3 Why do I get only the first page? other links do not appear in the results! This is the file nutch-site: <property> <name>http.agent.name</name> <value>NLP</value> </property> <property> <name>http.robots.agents</name> <value>NLP,*</value> </property> <property> <name>plugin.folders</name> <value>/home/enzo/Scrivania/nutch/apache-nutch-1.7/src/plugin</value> </property> <property> <name>urlfilter.regex.file</name> <value>regex-urlfilter.txt</value> </property> while the file "regex-urlfilter" I left it by default. Why not capture the other links? -- View this message in context: http://lucene.472066.n3.nabble.com/help-me-with-nutch-tp4095914p4096998.html Sent from the Nutch - User mailing list archive at Nabble.com.

