Hi Lewis, I tried running step through the individual stages for the crawl cycle in Nutch 2 with Cassandra. Below are the commands I ran and their respective logs :
1] bin/nutch inject urls inject_logs => http://pastebin.com/d8LnBemt 2] bin/nutch generate -topN 20 generate_logs => http://pastebin.com/YJCjvXHC readdb_dump_content_after_generate => http://pastebin.com/DBSY8y9t 3] bin/nutch fetch -all fetch_logs => http://pastebin.com/4QVwmD9w readdb_dump_content_after_fetch => http://pastebin.com/4Nfa1fxC 4] bin/nutch parse -all parse_logs => http://pastebin.com/faL6ZpRc readdb_dump_content_after_parse => http://pastebin.com/EnLv5DvX 5] bin/nutch updatedb I afraid there is no change even if I execute these commands instead of just bin/crawl command. If you observe, the readdb command gives the same output after fetch and after parse. I tried running same commands by installing Tinyproxy on my machine. When I installed and included configuration related to proxy in nutch-site.xml, it gave me exception as "api.RobotRulesParser - Couldn't get robots.txt for http://nutch.apache.org/: java.net.ConnectException: Connection refused ". Do you think its the issue of fetch job and parser job ? Or am I making any mistake while configuring the Nutch 2 which is preventing Nutch to crawl on webpages? (Please find the nutch-site.xml file at http://pastebin.com/bD3yw1jT ) If possible and if I am not asking too much can you please let all of us (who are facing this issue), how to configure Nutch 2 with Cassandra ? Because there is not good information available on agent.name, http.content, etc. Thanks for your help. -Sumant -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-with-Cassandra-as-a-storage-is-not-crawling-data-properly-tp4188115p4189776.html Sent from the Nutch - User mailing list archive at Nabble.com.

