Re: Nutch 2 with Cassandra as a storage is not crawling data properly

sumant Tue, 24 Feb 2015 13:06:42 -0800

Hi, please find my replies below:

1) Which version of Nutch are you using? Are you using the 2.X source code 
from here [0] e.g. 2.4-SNAPSHOT?


=> I am using apache-nutch-2.3-src.tar.gz downloaded from 
http://apache.cs.utah.edu/nutch/2.3/ <http://apache.cs.utah.edu/nutch/2.3/>   

===========================================================

2) Which version of Cassandra are you using? The recommended version of 
this Nutch codebase is currently 2.0.2 and Gora 0.5 dependencies. 

=> I am using dsc-cassandra-2.1.2 to load data. Shall I use 2.0.2 version ?

===========================================================

3) The way you are invoking the crawl script is pretty strange. Please read 
the input parameters 

=> I am running the crawl command from directory :
~/Documents/Softwares/apache-nutch-2.3/runtime/local

I tried below options to run the command:

1] bin/nutch crawl urls/ 10
Output : Command crawl is deprecated, please use bin/crawl instead

2] Then I tried bin/nutch bin/crawl urls/ 10
Output : Error: Could not find or load main class bin.crawl

3] So I tried running the command ->  bin/crawl urls/ crawlDir/
http://localhost:8983/solr/ 10
Output : It was running fine and data loaded in Cassandra, but was not
crawling beyond initial seed I provided in seed.txt

===========================================================

4) the solr_url parameters is optional. Meaning that if you enter it, and 
it is incorrect, then it will undoubtedly throw an error/exception. 

=> I think earlier I did not paste correct link in first message.
I gave solr_url as -> http://localhost:8983/solr/
I hope this url is correct. Please correct me if I am wrong.

===========================================================

5) Please provide a paste of your logs for the crawl task somewhere once 
you've addressed the above. 

=> Will provide you the logs in some time.
===========================================================

Please let me know if I am doing something wrong. Do you want me to send you
the nutch-site.xml ?
I think there are certain parameters which are causing nutch not to crawl
beyond initial seed.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-2-with-Cassandra-as-a-storage-is-not-crawling-data-properly-tp4188115p4188632.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 2 with Cassandra as a storage is not crawling data properly

Reply via email to