Re: AW: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Sebastian Nagel Mon, 27 May 2019 05:23:34 -0700

> Putting my index-writers.xml in a separate directory for each test system but 
> leaving
> the rest in a common directory does the trick!


Great! Thanks for the notice!

> Being able to configure the file names would sure be nice but for now I don't 
> mind
> having separate directories.

It's a rather trivial improvement. But we'll do it. :)

On 5/27/19 11:46 AM, Felix von Zadow wrote:
> 
> Hi Sebastian!
> 
> Thank you for your suggestion and detailed explanation!
> 
> Putting my index-writers.xml in a separate directory for each test system but 
> leaving the rest in a common directory does the trick!
> Being able to configure the file names would sure be nice but for now I don't 
> mind having separate directories.
> 
> Felix
> 
>> Von: Sebastian Nagel
>>
>> Hi Felix,
>>
>> assumed that every test crawl runs by its own not sharing resources with
>> other test crawls
>> (except the Nutch packages): you may just write a separate index-
>> writers.xml for every test, place
>> it in a separate directory and point NUTCH_CONF_DIR to this directory.
>> This works only in local mode (assuming that the tests do not run on a
>> Hadoop cluster).
>>
>> This may look like:
>>  .../
>>  |- test1/
>>  |  `- conf/
>>  |     |- index-writers.xml
>>  |     `- regex-urlfilter.txt
>>  |- test2/
>>  |  `- conf/
>>  |     |- index-writers.xml
>>  ...
>>
>> Now you run the test crawls with NUTCH_CONF_DIR as environment
>> variable:
>>  NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf
>> $NUTCH_HOME/bin/crawl
>> and
>>  NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf
>> $NUTCH_HOME/bin/crawl
>>
>> Configuration files are then first picked from test1/conf/ (resp. 
>> test2/conf/)
>> and if not
>> found there from $NUTCH_HOME/conf or from the class path.
>>
>> This allows also to test different URL filter rules etc.
>>
>> You may also set NUTCH_LOG_DIR for each test to log into different
>> hadoop.log files.
>>
>>
>> That's the easiest way I see so far. Unfortunately, the file names themselves
>> are not
>> configurable for index writers and exchanges configuration files. I've
>> opened
>>   https://issues.apache.org/jira/browse/NUTCH-2718
>> to get this resolved.
>>
>>
>> Best,
>> Sebastian
>>
>>
>> On 5/22/19 11:19 AM, Felix von Zadow wrote:
>>>
>>> Hello dear list!
>>>
>>> I have a problem with the new IndexWriter mechanism in 1.15. Hopefully
>> someone can point out to me what I should do differently.
>>>
>>> I have a couple of test systems running different versions of a web
>> application and there is a separate SOLR core for each of them. There is a
>> single VM that crawls and indexes content from scratch for every test
>> system that has been redeployed. So up until 1.14 I would simply specify
>> the target core (solr.server.url) when calling bin/crawl. Say, today I have
>> redeployed test_system_1, so I call bin/crawl to update the SOLR core
>> test_system_1.
>>>
>>> Now with 1.15 I cannot explicitly choose a target index anymore, so I tried
>> the following: In index-writers.xml, I specified an IndexWriter for each of 
>> my
>> systems/cores. In order to choose which IndexWriter to use, I specified an
>> exchange for every test system in exhanges.xml. It maps the host name
>> (unique to each test system) to the correct IndexWriter (and therefore the
>> correct core). This leaves me with two problems though:
>>>
>>> 1. I only ever want to index to one specific core during one crawl cycle and
>> I already KNOW its name. However, the Exchange expressions are evaluated
>> for every single document I'm indexing. The expression evaluates fine
>> though, so it "works" and this being a test environment, I could live with 
>> it.
>>>
>>> 2. All IndexWriters referenced by ANY of the Exchanges must actually
>> reference existing cores, even when only one of the IndexWriters is ever
>> actually being used. If any of the references cores does NOT exist, Nutch 
>> will
>> get a 404 for the non-existing core during the indexing phase and break. I
>> assume Nutch checks all referenced IndexWriters before starting indexing
>> just to be sure they are all available.
>>>
>>> Problem #2 is the crux for me since I can't reliably guarantee that all
>> (unrelated) cores are available during a certain crawl (and why should I
>> need to?).
>>>
>>>
>>> It's possible that my design is broken or my use case uncommon. But it
>> seems to me that I should be able to somewhat easily achieve what I could
>> with 1.14, i.e. explicitly choose the target core for each call of 
>> bin/crawl. A
>> solution would of course be to set up a separate crawling VM for each test
>> system, each with a single IndexWriter. But that can't be the way to go.
>>>
>>> Grateful for any kind of pointer towards a solution!
>>>
>>> Felix
>>>
>>>
>>>
>

Re: AW: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Reply via email to