Hi Jorge,
I have checked the job file and I can find the plugin folder being available
inside the classes/plugin folder.
Below are the entries made in build.xml in root directory
<packageset dir="${plugins.dir}/lhg-parser/src/java"/><packageset
dir="${plugins.dir}/lhg-parser/src/java"/>
<source path="${plugins.dir}/lhg-parser/src/java/" />
Entries for build.xml in plugin folder
<ant dir="lhg-parser" target="deploy"/>
<ant dir="lhg-parser" target="clean"/>
Regards,Lavanya
On Thursday, 7 May 2015, 22:53, Jorge Luis Betancourt González
<[email protected]> wrote:
Hi Lavanya,
I think you should post in your message what you added in the build.xml file
(the one in the root directory of the Nutch source code) and the one inside the
src/plugin. Although have you checked out that the plugin is being shipped
inside the job file?
Regards,
----- Original Message -----
From: "Lavanya Thirumalaisami" <[email protected]>
To: "Nutch User MailList" <[email protected]>
Sent: Thursday, May 7, 2015 7:14:58 AM
Subject: [MASSMAIL]Nutch 1.9 Plugins
Hi All,
I am using Nutch 1.9 to crawl throughwebsites and index data in to the solr. I
have a few queries regarding pluginsand running the nutch job in AWS EMR.
I have created a plugin extendingHTMLParseFilter and IndexingFilter. I have
done a trial run in our local systemusing cygwin and I was able to successfully
call the plugin. I am having troublerunning my plugin in Hadoop cluster.
Below are the steps how did I package myplugin. · Downloaded the source
code ofNutch 1.9, placed my plugin source inside the src\plugin folder inside
nutchsource code· I have written my build.xmlinside the parser.·
Added my plugin inside apache-nutch-1.9/build.xmlfile three entries as below·
Added the pluging inside apache-nutch-1.9/conf/nutch-site.xml under
theproperty <name>plugin.includes</name>· Added my plugin
apache-nutch-1.9/conf/parse-plugins.xmlunder the mimetype <mimeType
name="text/html">, <mimeTypename="application/xhtml+xml">,
<mimeTypename="text/xml">· Added my plugin to the buildfile under
apache-nutch-1.9/src/plugin/build.xml· I compiled my source codeusing
ant script
I have set-up an EMR cluster with 1 Master, 1core and 1 task to start the
testing. I moved the apache-nutch-1.9 folder tothe master node of Emr cluster.
I tried running the job in deploy mode fromruntime/deploy folder. My plugin was
not called. But it has printed at the endof crawl our plugin name as list of
installed plugins. I tried enabling thedebug logs in cluster in nutch
log4j.properties. It did not give muchinformation about the parsing stage. or
my plugin logs was not fetched.
bin/crawl s3://nutchtest/urls/nutch_deploymode
http://solrurl:4040/solr/testcollection / 2
I tried running the same in local mode fromruntime/local directory. My plugin
got picked up and was indexed with newattributes specified in plugin to SOLR.
bin/crawl urls/ nutch_local http:// solrurl :4040/solr/testcollection/ 2
Could you please help me or direct me aswhat am I doing wrong here.
Does running the script bin/crawl fromdeploy mode execute nutch job in hadoop
cluster? Since org.apache.nutch.crawl.Crawlclasses had been deprecated i cannot
run the below command. Is there adifferent commnad which i can use to run the
nutch in clustered environment.
hadoop jar apache-nutch-1.9.joborg.apache.nutch.crawl.Crawl -solr
http://solrurl :4040/solr/testcollection/ 2
I am using Nutch 1.9 and amazon Linux AMI version 2.4.11 (Hadoop1.03) Solr
version 4.7.2.
Regards,Lavanya