Nutch 1.15 IndexWriter -- how to explicitly choose one?

Felix von Zadow Wed, 22 May 2019 02:29:38 -0700

Hello dear list!

I have a problem with the new IndexWriter mechanism in 1.15. Hopefully someone 
can point out to me what I should do differently.


I have a couple of test systems running different versions of a web application 
and there is a separate SOLR core for each of them. There is a single VM that 
crawls and indexes content from scratch for every test system that has been 
redeployed. So up until 1.14 I would simply specify the target core 
(solr.server.url) when calling bin/crawl. Say, today I have redeployed 
test_system_1, so I call bin/crawl to update the SOLR core test_system_1.

Now with 1.15 I cannot explicitly choose a target index anymore, so I tried the 
following: In index-writers.xml, I specified an IndexWriter for each of my 
systems/cores. In order to choose which IndexWriter to use, I specified an 
exchange for every test system in exhanges.xml. It maps the host name (unique 
to each test system) to the correct IndexWriter (and therefore the correct 
core). This leaves me with two problems though:

1. I only ever want to index to one specific core during one crawl cycle and I 
already KNOW its name. However, the Exchange expressions are evaluated for 
every single document I'm indexing. The expression evaluates fine though, so it 
"works" and this being a test environment, I could live with it.

2. All IndexWriters referenced by ANY of the Exchanges must actually reference 
existing cores, even when only one of the IndexWriters is ever actually being 
used. If any of the references cores does NOT exist, Nutch will get a 404 for 
the non-existing core during the indexing phase and break. I assume Nutch 
checks all referenced IndexWriters before starting indexing just to be sure 
they are all available.

Problem #2 is the crux for me since I can't reliably guarantee that all 
(unrelated) cores are available during a certain crawl (and why should I need 
to?).


It's possible that my design is broken or my use case uncommon. But it seems to 
me that I should be able to somewhat easily achieve what I could with 1.14, 
i.e. explicitly choose the target core for each call of bin/crawl. A solution 
would of course be to set up a separate crawling VM for each test system, each 
with a single IndexWriter. But that can't be the way to go.

Grateful for any kind of pointer towards a solution!

Felix

Nutch 1.15 IndexWriter -- how to explicitly choose one?

Reply via email to