Hi Sol,
of course, you could provide a separate package for every crawl.
In local mode, it's easier to point NUTCH_CONF_DIR to the right directory,
could be even a hierarchy of folders to search for config files separated
by ':' (config files are actually searched on the Java classpath)
E.g., one could define a shell function for Nutch, e.g.
nutch () {
NUTCH_LOG_DIR=./logs NUTCH_CONF_DIR=./conf:$NUTCH_HOME/conf
$NUTCH_HOME/bin/nutch "$@"
}
Every config file in ./conf/ is taken first (usually nutch-site.xml) before
those
from $NUTCH_HOME/conf/.
For your specific use case, see also:
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
This would also work in cluster mode as you can set/overwrite properties
from command-line when launching Nutch.
Sebastian
On 11/08/2017 03:55 PM, Sol Lederman wrote:
> Hi,
>
> I need to have different regex-urlfilter.txt files for different crawls.
> Since the file lives in conf and I don't see a way to point nutch inject to
> a different file or a different conf directory, I assume I should just swap
> in a different regex-urlfilter.txt file every time I do a crawl.
>
> Does that sound right?
>
> Thanks.
>
> Sol
>