Hi Sebastian! Thank you for your suggestion and detailed explanation!
Putting my index-writers.xml in a separate directory for each test system but leaving the rest in a common directory does the trick! Being able to configure the file names would sure be nice but for now I don't mind having separate directories. Felix > Von: Sebastian Nagel > > Hi Felix, > > assumed that every test crawl runs by its own not sharing resources with > other test crawls > (except the Nutch packages): you may just write a separate index- > writers.xml for every test, place > it in a separate directory and point NUTCH_CONF_DIR to this directory. > This works only in local mode (assuming that the tests do not run on a > Hadoop cluster). > > This may look like: > .../ > |- test1/ > | `- conf/ > | |- index-writers.xml > | `- regex-urlfilter.txt > |- test2/ > | `- conf/ > | |- index-writers.xml > ... > > Now you run the test crawls with NUTCH_CONF_DIR as environment > variable: > NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf > $NUTCH_HOME/bin/crawl > and > NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf > $NUTCH_HOME/bin/crawl > > Configuration files are then first picked from test1/conf/ (resp. test2/conf/) > and if not > found there from $NUTCH_HOME/conf or from the class path. > > This allows also to test different URL filter rules etc. > > You may also set NUTCH_LOG_DIR for each test to log into different > hadoop.log files. > > > That's the easiest way I see so far. Unfortunately, the file names themselves > are not > configurable for index writers and exchanges configuration files. I've > opened > https://issues.apache.org/jira/browse/NUTCH-2718 > to get this resolved. > > > Best, > Sebastian > > > On 5/22/19 11:19 AM, Felix von Zadow wrote: > > > > Hello dear list! > > > > I have a problem with the new IndexWriter mechanism in 1.15. Hopefully > someone can point out to me what I should do differently. > > > > I have a couple of test systems running different versions of a web > application and there is a separate SOLR core for each of them. There is a > single VM that crawls and indexes content from scratch for every test > system that has been redeployed. So up until 1.14 I would simply specify > the target core (solr.server.url) when calling bin/crawl. Say, today I have > redeployed test_system_1, so I call bin/crawl to update the SOLR core > test_system_1. > > > > Now with 1.15 I cannot explicitly choose a target index anymore, so I tried > the following: In index-writers.xml, I specified an IndexWriter for each of my > systems/cores. In order to choose which IndexWriter to use, I specified an > exchange for every test system in exhanges.xml. It maps the host name > (unique to each test system) to the correct IndexWriter (and therefore the > correct core). This leaves me with two problems though: > > > > 1. I only ever want to index to one specific core during one crawl cycle and > I already KNOW its name. However, the Exchange expressions are evaluated > for every single document I'm indexing. The expression evaluates fine > though, so it "works" and this being a test environment, I could live with it. > > > > 2. All IndexWriters referenced by ANY of the Exchanges must actually > reference existing cores, even when only one of the IndexWriters is ever > actually being used. If any of the references cores does NOT exist, Nutch will > get a 404 for the non-existing core during the indexing phase and break. I > assume Nutch checks all referenced IndexWriters before starting indexing > just to be sure they are all available. > > > > Problem #2 is the crux for me since I can't reliably guarantee that all > (unrelated) cores are available during a certain crawl (and why should I > need to?). > > > > > > It's possible that my design is broken or my use case uncommon. But it > seems to me that I should be able to somewhat easily achieve what I could > with 1.14, i.e. explicitly choose the target core for each call of bin/crawl. > A > solution would of course be to set up a separate crawling VM for each test > system, each with a single IndexWriter. But that can't be the way to go. > > > > Grateful for any kind of pointer towards a solution! > > > > Felix > > > > > >

