Hi All,
I am using Nutch 1.9 to crawl throughwebsites and index data in to the solr. I 
have a few queries regarding pluginsand running the nutch job in AWS EMR.
I have created a plugin extendingHTMLParseFilter and IndexingFilter. I have 
done a trial run in our local systemusing cygwin and I was able to successfully 
call the plugin. I am having troublerunning my plugin in Hadoop cluster.
Below are the steps how did I package myplugin. ·        Downloaded the source 
code ofNutch 1.9, placed my plugin source inside the src\plugin folder inside 
nutchsource code·        I have written my build.xmlinside the parser.·        
Added my plugin inside apache-nutch-1.9/build.xmlfile three entries as below·   
     Added the pluging inside  apache-nutch-1.9/conf/nutch-site.xml under 
theproperty <name>plugin.includes</name>·        Added my plugin 
apache-nutch-1.9/conf/parse-plugins.xmlunder the mimetype <mimeType 
name="text/html">, <mimeTypename="application/xhtml+xml">, 
<mimeTypename="text/xml">·        Added my plugin to the buildfile under  
apache-nutch-1.9/src/plugin/build.xml·        I compiled my source codeusing 
ant script
I have set-up an EMR cluster with 1 Master, 1core and 1 task to start the 
testing. I moved the apache-nutch-1.9 folder tothe master node of Emr cluster. 
I tried running the job in deploy mode fromruntime/deploy folder. My plugin was 
not called. But it has printed at the endof crawl our plugin name as list of 
installed plugins. I tried enabling thedebug logs in cluster in nutch 
log4j.properties. It did not give muchinformation about the parsing stage. or 
my plugin logs was not fetched. 
bin/crawl s3://nutchtest/urls/nutch_deploymode  
http://solrurl:4040/solr/testcollection / 2
I tried running the same in local mode fromruntime/local directory. My plugin 
got picked up and was indexed with newattributes specified in plugin to SOLR. 
bin/crawl urls/ nutch_local  http:// solrurl :4040/solr/testcollection/ 2
Could you please help me or direct me aswhat am I doing wrong here. 
Does running the script bin/crawl fromdeploy mode execute nutch job in hadoop 
cluster? Since org.apache.nutch.crawl.Crawlclasses had been deprecated i cannot 
run the below command. Is there adifferent commnad which i can use to run the 
nutch in clustered environment. 
hadoop jar apache-nutch-1.9.joborg.apache.nutch.crawl.Crawl  -solr 
http://solrurl :4040/solr/testcollection/  2
I am using Nutch 1.9  and amazon Linux AMI version 2.4.11 (Hadoop1.03) Solr 
version 4.7.2.  
Regards,Lavanya

Reply via email to