Dear Community,

I'm running Inject job programatically, from within IntelliJ, where the target 
cluster's (YARN) configuration and Nutch configuration is in the classpath. In 
addition to this, HADOOP and NUTCH CONF and HOME directories are set - to 
distributions that I have on my local machine.

Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject 
job starts correctly. However, during the initialization (setup) phase of the 
mapper (InjectMapper), an exception is thrown:

Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
at 
org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)

On the YARN NodeManagers, a Nutch distribution is sitting with a configuration 
(nutch-site.xml) that has a key "plugin.folders" that points to the plugin 
folders by an absolute path. As for YARN, I've set up additional environment 
variables for NMs, as follows:

<property>
<name>yarn.nodemanager.admin-env</name>
<value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/</value>
</property>

In addition to this, I have set MR environment variables as well:

<property>
<name>mapred.child.env</name>
<value>NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf</value>
</property>

I've tried to run the program with JVM parameters, supplied with -D to define 
"plugin.folders".

Probably I'm missing something. How should I define "plugin.folders", when the 
inject job is submitted and run remotely.

Thanks for helping me out.

Zoltán

Reply via email to