> Putting my index-writers.xml in a separate directory for each test system but > leaving > the rest in a common directory does the trick!
Great! Thanks for the notice! > Being able to configure the file names would sure be nice but for now I don't > mind > having separate directories. It's a rather trivial improvement. But we'll do it. :) On 5/27/19 11:46 AM, Felix von Zadow wrote: > > Hi Sebastian! > > Thank you for your suggestion and detailed explanation! > > Putting my index-writers.xml in a separate directory for each test system but > leaving the rest in a common directory does the trick! > Being able to configure the file names would sure be nice but for now I don't > mind having separate directories. > > Felix > >> Von: Sebastian Nagel >> >> Hi Felix, >> >> assumed that every test crawl runs by its own not sharing resources with >> other test crawls >> (except the Nutch packages): you may just write a separate index- >> writers.xml for every test, place >> it in a separate directory and point NUTCH_CONF_DIR to this directory. >> This works only in local mode (assuming that the tests do not run on a >> Hadoop cluster). >> >> This may look like: >> .../ >> |- test1/ >> | `- conf/ >> | |- index-writers.xml >> | `- regex-urlfilter.txt >> |- test2/ >> | `- conf/ >> | |- index-writers.xml >> ... >> >> Now you run the test crawls with NUTCH_CONF_DIR as environment >> variable: >> NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf >> $NUTCH_HOME/bin/crawl >> and >> NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf >> $NUTCH_HOME/bin/crawl >> >> Configuration files are then first picked from test1/conf/ (resp. >> test2/conf/) >> and if not >> found there from $NUTCH_HOME/conf or from the class path. >> >> This allows also to test different URL filter rules etc. >> >> You may also set NUTCH_LOG_DIR for each test to log into different >> hadoop.log files. >> >> >> That's the easiest way I see so far. Unfortunately, the file names themselves >> are not >> configurable for index writers and exchanges configuration files. I've >> opened >> https://issues.apache.org/jira/browse/NUTCH-2718 >> to get this resolved. >> >> >> Best, >> Sebastian >> >> >> On 5/22/19 11:19 AM, Felix von Zadow wrote: >>> >>> Hello dear list! >>> >>> I have a problem with the new IndexWriter mechanism in 1.15. Hopefully >> someone can point out to me what I should do differently. >>> >>> I have a couple of test systems running different versions of a web >> application and there is a separate SOLR core for each of them. There is a >> single VM that crawls and indexes content from scratch for every test >> system that has been redeployed. So up until 1.14 I would simply specify >> the target core (solr.server.url) when calling bin/crawl. Say, today I have >> redeployed test_system_1, so I call bin/crawl to update the SOLR core >> test_system_1. >>> >>> Now with 1.15 I cannot explicitly choose a target index anymore, so I tried >> the following: In index-writers.xml, I specified an IndexWriter for each of >> my >> systems/cores. In order to choose which IndexWriter to use, I specified an >> exchange for every test system in exhanges.xml. It maps the host name >> (unique to each test system) to the correct IndexWriter (and therefore the >> correct core). This leaves me with two problems though: >>> >>> 1. I only ever want to index to one specific core during one crawl cycle and >> I already KNOW its name. However, the Exchange expressions are evaluated >> for every single document I'm indexing. The expression evaluates fine >> though, so it "works" and this being a test environment, I could live with >> it. >>> >>> 2. All IndexWriters referenced by ANY of the Exchanges must actually >> reference existing cores, even when only one of the IndexWriters is ever >> actually being used. If any of the references cores does NOT exist, Nutch >> will >> get a 404 for the non-existing core during the indexing phase and break. I >> assume Nutch checks all referenced IndexWriters before starting indexing >> just to be sure they are all available. >>> >>> Problem #2 is the crux for me since I can't reliably guarantee that all >> (unrelated) cores are available during a certain crawl (and why should I >> need to?). >>> >>> >>> It's possible that my design is broken or my use case uncommon. But it >> seems to me that I should be able to somewhat easily achieve what I could >> with 1.14, i.e. explicitly choose the target core for each call of >> bin/crawl. A >> solution would of course be to set up a separate crawling VM for each test >> system, each with a single IndexWriter. But that can't be the way to go. >>> >>> Grateful for any kind of pointer towards a solution! >>> >>> Felix >>> >>> >>> >