Re: [MASSMAIL]Nutch 1.9 Plugins

Jorge Luis Betancourt González Thu, 07 May 2015 05:53:50 -0700

Hi Lavanya,

I think you should post in your message what you added in the build.xml file 
(the one in the root directory of the Nutch source code) and the one inside the 
src/plugin. Although have you checked out that the plugin is being shipped 
inside the job file?

Regards,

----- Original Message -----
From: "Lavanya Thirumalaisami" <[email protected]>
To: "Nutch User MailList" <[email protected]>
Sent: Thursday, May 7, 2015 7:14:58 AM
Subject: [MASSMAIL]Nutch 1.9 Plugins

Hi All,
I am using Nutch 1.9 to crawl throughwebsites and index data in to the solr. I 
have a few queries regarding pluginsand running the nutch job in AWS EMR.
I have created a plugin extendingHTMLParseFilter and IndexingFilter. I have 
done a trial run in our local systemusing cygwin and I was able to successfully 
call the plugin. I am having troublerunning my plugin in Hadoop cluster.
Below are the steps how did I package myplugin. ·        Downloaded the source 
code ofNutch 1.9, placed my plugin source inside the src\plugin folder inside 
nutchsource code·        I have written my build.xmlinside the parser.·        
Added my plugin inside apache-nutch-1.9/build.xmlfile three entries as below·   
     Added the pluging inside  apache-nutch-1.9/conf/nutch-site.xml under 
theproperty <name>plugin.includes</name>·        Added my plugin 
apache-nutch-1.9/conf/parse-plugins.xmlunder the mimetype <mimeType 
name="text/html">, <mimeTypename="application/xhtml+xml">, 
<mimeTypename="text/xml">·        Added my plugin to the buildfile under  
apache-nutch-1.9/src/plugin/build.xml·        I compiled my source codeusing 
ant script
I have set-up an EMR cluster with 1 Master, 1core and 1 task to start the 
testing. I moved the apache-nutch-1.9 folder tothe master node of Emr cluster. 
I tried running the job in deploy mode fromruntime/deploy folder. My plugin was 
not called. But it has printed at the endof crawl our plugin name as list of 
installed plugins. I tried enabling thedebug logs in cluster in nutch 
log4j.properties. It did not give muchinformation about the parsing stage. or 
my plugin logs was not fetched. 
bin/crawl s3://nutchtest/urls/nutch_deploymode  
http://solrurl:4040/solr/testcollection / 2
I tried running the same in local mode fromruntime/local directory. My plugin 
got picked up and was indexed with newattributes specified in plugin to SOLR. 
bin/crawl urls/ nutch_local  http:// solrurl :4040/solr/testcollection/ 2
Could you please help me or direct me aswhat am I doing wrong here. 
Does running the script bin/crawl fromdeploy mode execute nutch job in hadoop 
cluster? Since org.apache.nutch.crawl.Crawlclasses had been deprecated i cannot 
run the below command. Is there adifferent commnad which i can use to run the 
nutch in clustered environment. 
hadoop jar apache-nutch-1.9.joborg.apache.nutch.crawl.Crawl  -solr 
http://solrurl :4040/solr/testcollection/  2
I am using Nutch 1.9  and amazon Linux AMI version 2.4.11 (Hadoop1.03) Solr 
version 4.7.2.  
Regards,Lavanya

Re: [MASSMAIL]Nutch 1.9 Plugins

Reply via email to