Hi All,
I am using Nutch 1.9 to crawl throughwebsites and index data in to the solr. I
have a few queries regarding pluginsand running the nutch job in AWS EMR.
I have created a plugin extendingHTMLParseFilter and IndexingFilter. I have
done a trial run in our local systemusing cygwin and I was able to successfully
call the plugin. I am having troublerunning my plugin in Hadoop cluster.
Below are the steps how did I package myplugin. · Downloaded the source
code ofNutch 1.9, placed my plugin source inside the src\plugin folder inside
nutchsource code· I have written my build.xmlinside the parser.·
Added my plugin inside apache-nutch-1.9/build.xmlfile three entries as below·
Added the pluging inside apache-nutch-1.9/conf/nutch-site.xml under
theproperty <name>plugin.includes</name>· Added my plugin
apache-nutch-1.9/conf/parse-plugins.xmlunder the mimetype <mimeType
name="text/html">, <mimeTypename="application/xhtml+xml">,
<mimeTypename="text/xml">· Added my plugin to the buildfile under
apache-nutch-1.9/src/plugin/build.xml· I compiled my source codeusing
ant script
I have set-up an EMR cluster with 1 Master, 1core and 1 task to start the
testing. I moved the apache-nutch-1.9 folder tothe master node of Emr cluster.
I tried running the job in deploy mode fromruntime/deploy folder. My plugin was
not called. But it has printed at the endof crawl our plugin name as list of
installed plugins. I tried enabling thedebug logs in cluster in nutch
log4j.properties. It did not give muchinformation about the parsing stage. or
my plugin logs was not fetched.
bin/crawl s3://nutchtest/urls/nutch_deploymode
http://solrurl:4040/solr/testcollection / 2
I tried running the same in local mode fromruntime/local directory. My plugin
got picked up and was indexed with newattributes specified in plugin to SOLR.
bin/crawl urls/ nutch_local http:// solrurl :4040/solr/testcollection/ 2
Could you please help me or direct me aswhat am I doing wrong here.
Does running the script bin/crawl fromdeploy mode execute nutch job in hadoop
cluster? Since org.apache.nutch.crawl.Crawlclasses had been deprecated i cannot
run the below command. Is there adifferent commnad which i can use to run the
nutch in clustered environment.
hadoop jar apache-nutch-1.9.joborg.apache.nutch.crawl.Crawl -solr
http://solrurl :4040/solr/testcollection/ 2
I am using Nutch 1.9 and amazon Linux AMI version 2.4.11 (Hadoop1.03) Solr
version 4.7.2.
Regards,Lavanya