Re: Configuration is not found by Nutch when running Inject remotely

Sebastian Nagel Wed, 19 Jul 2017 12:28:03 -0700

> is to pass configuration parameters programatically

That shouldn't be difficult as all Nutch tools get the configuration from
the class NutchConfiguration.


Thanks,
Sebastian

On 07/19/2017 06:40 PM, Zoltán Zvara wrote:
> Hi Sebastian,
> 
> Thanks for your tips. I have switched on debugging for YARN, and kept 
> "launch_container.sh" for a few minutes to be able to examine. HADOOP AND 
> NUTCH CONF + HOME directories were correctly set for AM as well as 
> MR.YarnChild. CLASSPATH has been set correctly to Nutch configuration, 
> therefore nutch-site.xml should be picked up. As I've realized, some 
> "job.xml" is attached to the submission from my remote computer, which 
> includes any parameter set by the remote JVM by a HadoopConfiguration. This 
> means the only way to configure such a remote launch is to pass configuration 
> parameters programatically.
> 
> For example:
> val hConf = new HadoopConfiguration()
> hConf.set(..., ...)
> hConf.set(..., ...)
> 
> val injection = new Injection(hConf)
> injection.inject(...)
> 
> The above is just a pseudo code. Sorry if there are any mistakes.
> 
> Cheers,
> Zoltán
> On 2017-07-19 17:43:13, Sebastian Nagel <[email protected]> wrote:
> Hi Zoltán,
> 
> a warning ahead: personally, I've never tried to control Nutch launch 
> remotely,
> so I know no solution.
> 
> If the property "plugin.folders" is not known this means Nutch
> also didn't read nutch-default.xml where it is defined. I would start
> to look at the classpath whether it contains the configuration
> folder (local mode) or the apache-nutch-*.job file (distributed mode).
> 
> Note that the environment variable NUTCH_CONF_DIR is used only by
> bin/nutch - the path is added to the classpath. Loading of configuration
> files (nutch-site.xml and nutch-default.xml) is delegated to Hadoop.
> Similarly, NUTCH_HOME is only used to find the Nutch installation or
> the job file.
> 
> To analyze the problem, try to set
> log4j.logger.org.apache.hadoop=WARN
> to INFO or DEBUG.
> 
> Best,
> Sebastian
> 
> On 07/18/2017 08:50 PM, Zoltán Zvara wrote:
>> Dear Community,
>>
>> I'm running Inject job programatically, from within IntelliJ, where the 
>> target cluster's (YARN) configuration and Nutch configuration is in the 
>> classpath. In addition to this, HADOOP and NUTCH CONF and HOME directories 
>> are set - to distributions that I have on my local machine.
>>
>> Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject 
>> job starts correctly. However, during the initialization (setup) phase of 
>> the mapper (InjectMapper), an exception is thrown:
>>
>> Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
>> at 
>> org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
>> at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
>> at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
>> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
>> at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
>>
>> On the YARN NodeManagers, a Nutch distribution is sitting with a 
>> configuration (nutch-site.xml) that has a key "plugin.folders" that points 
>> to the plugin folders by an absolute path. As for YARN, I've set up 
>> additional environment variables for NMs, as follows:
>>
>>
>> yarn.nodemanager.admin-env
>> MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/
>>
>>
>> In addition to this, I have set MR environment variables as well:
>>
>>
>> mapred.child.env
>> NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf
>>
>>
>> I've tried to run the program with JVM parameters, supplied with -D to 
>> define "plugin.folders".
>>
>> Probably I'm missing something. How should I define "plugin.folders", when 
>> the inject job is submitted and run remotely.
>>
>> Thanks for helping me out.
>>
>> Zoltán
>>
> 
>

Re: Configuration is not found by Nutch when running Inject remotely

Reply via email to