I'm trying to setup a simple crawler on my local machine. I've been using
this tutorial: https://wiki.apache.org/nutch/NutchTutorial and this
tutorial: http://www.mind-it.info/integrating-nutch-1-7-elasticsearch/
When I try the following command:
bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/2015050*
I get the following output:
Indexer: starting at 2015-05-09 07:59:11
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default
250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500
~2.5MB)
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
I set my clustername, host and port in nutch-site.xml but it doesn't seem
to be recognized. Am I missing any additional steps?