Re: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Sebastian Nagel Wed, 22 May 2019 07:34:30 -0700

Hi Felix,

assumed that every test crawl runs by its own not sharing resources with other 
test crawls
(except the Nutch packages): you may just write a separate index-writers.xml 
for every test, place
it in a separate directory and point NUTCH_CONF_DIR to this directory.
This works only in local mode (assuming that the tests do not run on a Hadoop 
cluster).


This may look like:
 .../
 |- test1/
 |  `- conf/
 |     |- index-writers.xml
 |     `- regex-urlfilter.txt
 |- test2/
 |  `- conf/
 |     |- index-writers.xml
 ...

Now you run the test crawls with NUTCH_CONF_DIR as environment variable:
 NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf  $NUTCH_HOME/bin/crawl
and
 NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf  $NUTCH_HOME/bin/crawl

Configuration files are then first picked from test1/conf/ (resp. test2/conf/) 
and if not
found there from $NUTCH_HOME/conf or from the class path.

This allows also to test different URL filter rules etc.

You may also set NUTCH_LOG_DIR for each test to log into different hadoop.log 
files.


That's the easiest way I see so far. Unfortunately, the file names themselves 
are not
configurable for index writers and exchanges configuration files. I've opened
  https://issues.apache.org/jira/browse/NUTCH-2718
to get this resolved.


Best,
Sebastian


On 5/22/19 11:19 AM, Felix von Zadow wrote:
> 
> Hello dear list!
> 
> I have a problem with the new IndexWriter mechanism in 1.15. Hopefully 
> someone can point out to me what I should do differently.
> 
> I have a couple of test systems running different versions of a web 
> application and there is a separate SOLR core for each of them. There is a 
> single VM that crawls and indexes content from scratch for every test system 
> that has been redeployed. So up until 1.14 I would simply specify the target 
> core (solr.server.url) when calling bin/crawl. Say, today I have redeployed 
> test_system_1, so I call bin/crawl to update the SOLR core test_system_1.
> 
> Now with 1.15 I cannot explicitly choose a target index anymore, so I tried 
> the following: In index-writers.xml, I specified an IndexWriter for each of 
> my systems/cores. In order to choose which IndexWriter to use, I specified an 
> exchange for every test system in exhanges.xml. It maps the host name (unique 
> to each test system) to the correct IndexWriter (and therefore the correct 
> core). This leaves me with two problems though:
> 
> 1. I only ever want to index to one specific core during one crawl cycle and 
> I already KNOW its name. However, the Exchange expressions are evaluated for 
> every single document I'm indexing. The expression evaluates fine though, so 
> it "works" and this being a test environment, I could live with it.
> 
> 2. All IndexWriters referenced by ANY of the Exchanges must actually 
> reference existing cores, even when only one of the IndexWriters is ever 
> actually being used. If any of the references cores does NOT exist, Nutch 
> will get a 404 for the non-existing core during the indexing phase and break. 
> I assume Nutch checks all referenced IndexWriters before starting indexing 
> just to be sure they are all available.
> 
> Problem #2 is the crux for me since I can't reliably guarantee that all 
> (unrelated) cores are available during a certain crawl (and why should I need 
> to?).
> 
> 
> It's possible that my design is broken or my use case uncommon. But it seems 
> to me that I should be able to somewhat easily achieve what I could with 
> 1.14, i.e. explicitly choose the target core for each call of bin/crawl. A 
> solution would of course be to set up a separate crawling VM for each test 
> system, each with a single IndexWriter. But that can't be the way to go.
> 
> Grateful for any kind of pointer towards a solution!
> 
> Felix
> 
> 
>

Re: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Reply via email to