Hi, Nutch Gurus, I have a use case that I need to implement and I hope that someone can help.
I have a situation where I need to generate and build URLs dynamically and pass them to the respective filter. I want to pass a newly constructed string to the Filter implementation associated with regex-urlfilter.txt the following new string to parse. # URLs to be excluded -http://foo[aZ-zZ0-9]\.mydomain.com -https:// foo[aZ-zZ0-9]\.mydomain.com # URL to be crawled +http://newfoo[aZ-zZ0-9]\.mydomain.com +https://newfoo[aZ-zZ0-9]\.mydomain.com >From the Nutch's RegexURLFilter.java implementation, we have the following set >up. public static final String URLFILTER_REGEX_FILE = "urlfilter.regex.file"; public static final String URLFILTER_REGEX_RULES = "urlfilter.regex.rules"; /** * Rules specified as a config property will override rules specified * as a config file. */ protected Reader getRulesReader(Configuration conf) throws IOException { String stringRules = conf.get(URLFILTER_REGEX_RULES); LOG.debug("The string rules = " + stringRules); if (stringRules != null) { LOG.debug("The string rules are not null. Returning a String Reader object."); return new StringReader(stringRules); } String fileRules = conf.get(URLFILTER_REGEX_FILE); LOG.debug("The fileRules rules = " + fileRules); LOG.debug("Getting the rules as an input stream."); return conf.getConfResourceAsReader(fileRules); } I have a TimerTask implementation that based on certain conditions, updates the Configuration object. public class MyTask extends TimerTask { private Configuration configuration; // Get and Setter. @Override public void run() { // Some backend logic that involves constructing the URL if updated. String urlFilterRegexRules = new StringBuilder(. . . . ).toString(); Map<String, Object> argsMap = new HashMap<>(); Random random = new Random(1e8); long num = random.nextLong(); argsMap.put(NUTCH.ARGS_SEEDDIR, "/tmp/seed" + num + ".txt"); this.configuration.set(RegexURLFilter.URLFILTER_REGEX_RULES, urlFilterRegexRules); InjectorJob job = new InjectorJob(this.configuration); job.run(argsMap); } } >From the logs. 2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:71 - The string rules = null 2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:77 - The fileRules rules = regex-urlfilter.txt 2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:78 - Getting the rules as an input stream. What am I doing wrong? Any advice would be gratefully appreciated. My modified crawler main method Crawler.java public static void main(String[] args) { Configuration configuration = NutchConfiguration.create() Timer timer = new Timer(); MyTask myTask = new MyTask(); myTask.setConfiguration(configuration); timer.scheduleAtFixedRate(myTask, 0, 4 * 60 * 60 * 1000); ToolRunner.run(refreshConfigTask.getConfiguration(), crawler, args); } Thanks, Kartik ---------------------------------------------------------------------- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.

