> is to pass configuration parameters programatically That shouldn't be difficult as all Nutch tools get the configuration from the class NutchConfiguration.
Thanks, Sebastian On 07/19/2017 06:40 PM, Zoltán Zvara wrote: > Hi Sebastian, > > Thanks for your tips. I have switched on debugging for YARN, and kept > "launch_container.sh" for a few minutes to be able to examine. HADOOP AND > NUTCH CONF + HOME directories were correctly set for AM as well as > MR.YarnChild. CLASSPATH has been set correctly to Nutch configuration, > therefore nutch-site.xml should be picked up. As I've realized, some > "job.xml" is attached to the submission from my remote computer, which > includes any parameter set by the remote JVM by a HadoopConfiguration. This > means the only way to configure such a remote launch is to pass configuration > parameters programatically. > > For example: > val hConf = new HadoopConfiguration() > hConf.set(..., ...) > hConf.set(..., ...) > > val injection = new Injection(hConf) > injection.inject(...) > > The above is just a pseudo code. Sorry if there are any mistakes. > > Cheers, > Zoltán > On 2017-07-19 17:43:13, Sebastian Nagel <[email protected]> wrote: > Hi Zoltán, > > a warning ahead: personally, I've never tried to control Nutch launch > remotely, > so I know no solution. > > If the property "plugin.folders" is not known this means Nutch > also didn't read nutch-default.xml where it is defined. I would start > to look at the classpath whether it contains the configuration > folder (local mode) or the apache-nutch-*.job file (distributed mode). > > Note that the environment variable NUTCH_CONF_DIR is used only by > bin/nutch - the path is added to the classpath. Loading of configuration > files (nutch-site.xml and nutch-default.xml) is delegated to Hadoop. > Similarly, NUTCH_HOME is only used to find the Nutch installation or > the job file. > > To analyze the problem, try to set > log4j.logger.org.apache.hadoop=WARN > to INFO or DEBUG. > > Best, > Sebastian > > On 07/18/2017 08:50 PM, Zoltán Zvara wrote: >> Dear Community, >> >> I'm running Inject job programatically, from within IntelliJ, where the >> target cluster's (YARN) configuration and Nutch configuration is in the >> classpath. In addition to this, HADOOP and NUTCH CONF and HOME directories >> are set - to distributions that I have on my local machine. >> >> Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject >> job starts correctly. However, during the initialization (setup) phase of >> the mapper (InjectMapper), an exception is thrown: >> >> Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined >> at >> org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78) >> at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71) >> at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99) >> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117) >> at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70) >> >> On the YARN NodeManagers, a Nutch distribution is sitting with a >> configuration (nutch-site.xml) that has a key "plugin.folders" that points >> to the plugin folders by an absolute path. As for YARN, I've set up >> additional environment variables for NMs, as follows: >> >> >> yarn.nodemanager.admin-env >> MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/ >> >> >> In addition to this, I have set MR environment variables as well: >> >> >> mapred.child.env >> NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf >> >> >> I've tried to run the program with JVM parameters, supplied with -D to >> define "plugin.folders". >> >> Probably I'm missing something. How should I define "plugin.folders", when >> the inject job is submitted and run remotely. >> >> Thanks for helping me out. >> >> Zoltán >> > >

