Hi Lewis,

I tried running step through the individual stages for the crawl cycle in
Nutch 2 with Cassandra. 
Below are the commands I ran and their respective logs :

1]  bin/nutch inject urls
inject_logs => http://pastebin.com/d8LnBemt

2] bin/nutch generate -topN 20
generate_logs => http://pastebin.com/YJCjvXHC
readdb_dump_content_after_generate => http://pastebin.com/DBSY8y9t

3] bin/nutch fetch -all
fetch_logs => http://pastebin.com/4QVwmD9w
readdb_dump_content_after_fetch => http://pastebin.com/4Nfa1fxC

4] bin/nutch parse -all
parse_logs => http://pastebin.com/faL6ZpRc
readdb_dump_content_after_parse => http://pastebin.com/EnLv5DvX

5] bin/nutch updatedb

I afraid there is no change even if I execute these commands instead of just
bin/crawl command.
If you observe, the readdb command gives the same output after fetch and
after parse.

I tried running same commands by installing Tinyproxy on my machine. When I
installed and included configuration related to proxy in nutch-site.xml, it
gave me exception as "api.RobotRulesParser - Couldn't get robots.txt for
http://nutch.apache.org/: java.net.ConnectException: Connection refused ".

Do you think its the issue of fetch job and parser job ? 
Or am I making any mistake while configuring the Nutch 2 which is preventing
Nutch to crawl on webpages? (Please find the nutch-site.xml file at
http://pastebin.com/bD3yw1jT )

If possible and if I am not asking too much can you please let all of us
(who are facing this issue), how to configure Nutch 2 with Cassandra ?
Because there is not good information available on agent.name, http.content,
etc.

Thanks for your help.
-Sumant



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-2-with-Cassandra-as-a-storage-is-not-crawling-data-properly-tp4188115p4189776.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to