AW: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Felix von Zadow Mon, 27 May 2019 02:47:36 -0700

Hi Sebastian!

Thank you for your suggestion and detailed explanation!


Putting my index-writers.xml in a separate directory for each test system but 
leaving the rest in a common directory does the trick!
Being able to configure the file names would sure be nice but for now I don't 
mind having separate directories.

Felix

> Von: Sebastian Nagel
> 
> Hi Felix,
> 
> assumed that every test crawl runs by its own not sharing resources with
> other test crawls
> (except the Nutch packages): you may just write a separate index-
> writers.xml for every test, place
> it in a separate directory and point NUTCH_CONF_DIR to this directory.
> This works only in local mode (assuming that the tests do not run on a
> Hadoop cluster).
> 
> This may look like:
>  .../
>  |- test1/
>  |  `- conf/
>  |     |- index-writers.xml
>  |     `- regex-urlfilter.txt
>  |- test2/
>  |  `- conf/
>  |     |- index-writers.xml
>  ...
> 
> Now you run the test crawls with NUTCH_CONF_DIR as environment
> variable:
>  NUTCH_CONF_DIR=.../test1/conf:$NUTCH_HOME/conf
> $NUTCH_HOME/bin/crawl
> and
>  NUTCH_CONF_DIR=.../test2/conf:$NUTCH_HOME/conf
> $NUTCH_HOME/bin/crawl
> 
> Configuration files are then first picked from test1/conf/ (resp. test2/conf/)
> and if not
> found there from $NUTCH_HOME/conf or from the class path.
> 
> This allows also to test different URL filter rules etc.
> 
> You may also set NUTCH_LOG_DIR for each test to log into different
> hadoop.log files.
> 
> 
> That's the easiest way I see so far. Unfortunately, the file names themselves
> are not
> configurable for index writers and exchanges configuration files. I've
> opened
>   https://issues.apache.org/jira/browse/NUTCH-2718
> to get this resolved.
> 
> 
> Best,
> Sebastian
> 
> 
> On 5/22/19 11:19 AM, Felix von Zadow wrote:
> >
> > Hello dear list!
> >
> > I have a problem with the new IndexWriter mechanism in 1.15. Hopefully
> someone can point out to me what I should do differently.
> >
> > I have a couple of test systems running different versions of a web
> application and there is a separate SOLR core for each of them. There is a
> single VM that crawls and indexes content from scratch for every test
> system that has been redeployed. So up until 1.14 I would simply specify
> the target core (solr.server.url) when calling bin/crawl. Say, today I have
> redeployed test_system_1, so I call bin/crawl to update the SOLR core
> test_system_1.
> >
> > Now with 1.15 I cannot explicitly choose a target index anymore, so I tried
> the following: In index-writers.xml, I specified an IndexWriter for each of my
> systems/cores. In order to choose which IndexWriter to use, I specified an
> exchange for every test system in exhanges.xml. It maps the host name
> (unique to each test system) to the correct IndexWriter (and therefore the
> correct core). This leaves me with two problems though:
> >
> > 1. I only ever want to index to one specific core during one crawl cycle and
> I already KNOW its name. However, the Exchange expressions are evaluated
> for every single document I'm indexing. The expression evaluates fine
> though, so it "works" and this being a test environment, I could live with it.
> >
> > 2. All IndexWriters referenced by ANY of the Exchanges must actually
> reference existing cores, even when only one of the IndexWriters is ever
> actually being used. If any of the references cores does NOT exist, Nutch will
> get a 404 for the non-existing core during the indexing phase and break. I
> assume Nutch checks all referenced IndexWriters before starting indexing
> just to be sure they are all available.
> >
> > Problem #2 is the crux for me since I can't reliably guarantee that all
> (unrelated) cores are available during a certain crawl (and why should I
> need to?).
> >
> >
> > It's possible that my design is broken or my use case uncommon. But it
> seems to me that I should be able to somewhat easily achieve what I could
> with 1.14, i.e. explicitly choose the target core for each call of bin/crawl. 
> A
> solution would of course be to set up a separate crawling VM for each test
> system, each with a single IndexWriter. But that can't be the way to go.
> >
> > Grateful for any kind of pointer towards a solution!
> >
> > Felix
> >
> >
> >

AW: Nutch 1.15 IndexWriter -- how to explicitly choose one?

Reply via email to