unable to index to elasticsearch from nutch 1.12

Srinivasan Ramaswamy Tue, 29 Nov 2016 16:17:08 -0800

I am using nutch-1.12. I downloaded the binary and setup as instructed in
the wiki. I have setup the following properties in my nutch-site.xml


<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 </property>

 <property>
  <name>elastic.host</name>
  <value>localhost</value>
  <description>The hostname to send documents to using TransportClient.
Either host
  and port must be defined or cluster.</description>
</property>

<property>
  <name>elastic.port</name>
  <value>9300</value>
  <description>The port to connect to using TransportClient.</description>
</property>

<property>
  <name>elastic.cluster</name>
  <value>elasticsearch</value>
  <description>The cluster name to discover. Either host and port must be
defined
  or cluster.</description>
</property>

after crawling when i try to index the content using the command

$ bin/nutch index elasticsearch crawl/segments/20161129130824/

srramasw-osx:apache-nutch-1.12 srramasw$ bin/nutch index elasticsearch $s1
Segment dir is complete: crawl/segments/20161129130824.
Indexer: starting at 2016-11-29 16:07:03
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
file:/Users/srramasw/Tools/apache-nutch-1.12/elasticsearch/current
at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)


I searched around in the web for the problem, lot of people reported it
could be due to elasticsearch version mismatch. I made sure i am running
1.4.1 version of elasticsearch locally.

Any idea on what causes this error ?


Thanks
Srini

unable to index to elasticsearch from nutch 1.12

Reply via email to