To run nutch fetches on multiple sites you can set a NUTCH_CONF_DIR
Environment Variable as it says in th FAQ:

http://wiki.apache.org/nutch/FAQ

" How can I force fetcher to use custom nutch-config?

   - Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
   - Copy these files from $NUTCH_HOME/conf to the new directory:
   common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml,
   regex-normalize.xml, regex-urlfilter.txt
   - Modify the nutch-default.xml to suite your needs
   - Set NUTCH_CONF_DIR environment variable to point into the directory you
   created
   - run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR
   environment variable. You should check the command outputs for lines where
   the configs are loaded, that they are really loaded from your custom dir.
   - Happy using."


This worked fine with when I wasn't using a hadoop cluster. However, when I
started a hadoop cluster, it ignored the regex-urlfilter.txt. However, when
I included the regex-urlfilter.txt with the hadoop configuration files and
restarted the cluster, the nutch fetch followed the rules. The problem with
this is that if you are doing two different fetches with different
regex-urlfilter.txts, you need stop the hdfs cluster, put in new
configuration files, and restart the cluster.

Is there any other way of setting this up?

Thanks,
Steve

Reply via email to