Hi Felix, assumed that every test crawl runs by its own not sharing resources with other test crawls (except the Nutch packages): you may just write a separate index-writers.xml for every test, place it in a separate directory and point NUTCH_CONF_DIR to this directory. This works only in local mode (assuming that the tests do not run on a Hadoop cluster).
This may look like: .../ |- test1/ | `- conf/ | |- index-writers.xml | `- regex-urlfilter.txt |- test2/ | `- conf/ | |- index-writers.xml ... Now you run the test crawls with NUTCH_CONF_DIR as environment variable: NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf $NUTCH_HOME/bin/crawl and NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf $NUTCH_HOME/bin/crawl Configuration files are then first picked from test1/conf/ (resp. test2/conf/) and if not found there from $NUTCH_HOME/conf or from the class path. This allows also to test different URL filter rules etc. You may also set NUTCH_LOG_DIR for each test to log into different hadoop.log files. That's the easiest way I see so far. Unfortunately, the file names themselves are not configurable for index writers and exchanges configuration files. I've opened https://issues.apache.org/jira/browse/NUTCH-2718 to get this resolved. Best, Sebastian On 5/22/19 11:19 AM, Felix von Zadow wrote: > > Hello dear list! > > I have a problem with the new IndexWriter mechanism in 1.15. Hopefully > someone can point out to me what I should do differently. > > I have a couple of test systems running different versions of a web > application and there is a separate SOLR core for each of them. There is a > single VM that crawls and indexes content from scratch for every test system > that has been redeployed. So up until 1.14 I would simply specify the target > core (solr.server.url) when calling bin/crawl. Say, today I have redeployed > test_system_1, so I call bin/crawl to update the SOLR core test_system_1. > > Now with 1.15 I cannot explicitly choose a target index anymore, so I tried > the following: In index-writers.xml, I specified an IndexWriter for each of > my systems/cores. In order to choose which IndexWriter to use, I specified an > exchange for every test system in exhanges.xml. It maps the host name (unique > to each test system) to the correct IndexWriter (and therefore the correct > core). This leaves me with two problems though: > > 1. I only ever want to index to one specific core during one crawl cycle and > I already KNOW its name. However, the Exchange expressions are evaluated for > every single document I'm indexing. The expression evaluates fine though, so > it "works" and this being a test environment, I could live with it. > > 2. All IndexWriters referenced by ANY of the Exchanges must actually > reference existing cores, even when only one of the IndexWriters is ever > actually being used. If any of the references cores does NOT exist, Nutch > will get a 404 for the non-existing core during the indexing phase and break. > I assume Nutch checks all referenced IndexWriters before starting indexing > just to be sure they are all available. > > Problem #2 is the crux for me since I can't reliably guarantee that all > (unrelated) cores are available during a certain crawl (and why should I need > to?). > > > It's possible that my design is broken or my use case uncommon. But it seems > to me that I should be able to somewhat easily achieve what I could with > 1.14, i.e. explicitly choose the target core for each call of bin/crawl. A > solution would of course be to set up a separate crawling VM for each test > system, each with a single IndexWriter. But that can't be the way to go. > > Grateful for any kind of pointer towards a solution! > > Felix > > >