Hi Lavanya, I think you should post in your message what you added in the build.xml file (the one in the root directory of the Nutch source code) and the one inside the src/plugin. Although have you checked out that the plugin is being shipped inside the job file?
Regards, ----- Original Message ----- From: "Lavanya Thirumalaisami" <[email protected]> To: "Nutch User MailList" <[email protected]> Sent: Thursday, May 7, 2015 7:14:58 AM Subject: [MASSMAIL]Nutch 1.9 Plugins Hi All, I am using Nutch 1.9 to crawl throughwebsites and index data in to the solr. I have a few queries regarding pluginsand running the nutch job in AWS EMR. I have created a plugin extendingHTMLParseFilter and IndexingFilter. I have done a trial run in our local systemusing cygwin and I was able to successfully call the plugin. I am having troublerunning my plugin in Hadoop cluster. Below are the steps how did I package myplugin. · Downloaded the source code ofNutch 1.9, placed my plugin source inside the src\plugin folder inside nutchsource code· I have written my build.xmlinside the parser.· Added my plugin inside apache-nutch-1.9/build.xmlfile three entries as below· Added the pluging inside apache-nutch-1.9/conf/nutch-site.xml under theproperty <name>plugin.includes</name>· Added my plugin apache-nutch-1.9/conf/parse-plugins.xmlunder the mimetype <mimeType name="text/html">, <mimeTypename="application/xhtml+xml">, <mimeTypename="text/xml">· Added my plugin to the buildfile under apache-nutch-1.9/src/plugin/build.xml· I compiled my source codeusing ant script I have set-up an EMR cluster with 1 Master, 1core and 1 task to start the testing. I moved the apache-nutch-1.9 folder tothe master node of Emr cluster. I tried running the job in deploy mode fromruntime/deploy folder. My plugin was not called. But it has printed at the endof crawl our plugin name as list of installed plugins. I tried enabling thedebug logs in cluster in nutch log4j.properties. It did not give muchinformation about the parsing stage. or my plugin logs was not fetched. bin/crawl s3://nutchtest/urls/nutch_deploymode http://solrurl:4040/solr/testcollection / 2 I tried running the same in local mode fromruntime/local directory. My plugin got picked up and was indexed with newattributes specified in plugin to SOLR. bin/crawl urls/ nutch_local http:// solrurl :4040/solr/testcollection/ 2 Could you please help me or direct me aswhat am I doing wrong here. Does running the script bin/crawl fromdeploy mode execute nutch job in hadoop cluster? Since org.apache.nutch.crawl.Crawlclasses had been deprecated i cannot run the below command. Is there adifferent commnad which i can use to run the nutch in clustered environment. hadoop jar apache-nutch-1.9.joborg.apache.nutch.crawl.Crawl -solr http://solrurl :4040/solr/testcollection/ 2 I am using Nutch 1.9 and amazon Linux AMI version 2.4.11 (Hadoop1.03) Solr version 4.7.2. Regards,Lavanya

