Hi,

Nutch loads all configuration files from the Java class path and picks the first
file found on the class path (and ignores other files with the same name).

If there are multiple crawls with different configurations, just place a 
crawl-specific
configuration directory in front of the classpath. If bin/nutch and/or 
bin/crawl are
used this is done by pointing the environment variable NUTCH_CONFIG_DIR to this 
directory.
There's also the var NUTCH_LOG_DIR which allows to configure the directory for 
the
full log files (hadoop.log).

Let's say you have

crawlA/
  conf/
    nutch-site.xml
    index-writers.xml
    regex-urlfilter.txt
  logs/
crawB/
  conf/
    nutch-site.xml
    index-writers.xml
  logs/

and your Nutch installation (binary package or runtime/local when building from 
source):
$NUTCH_HOME/
  bin/...
  conf/
    nutch-site.xml
    index-writers.xml
    regex-urlfilter.txt
    suffix-urlfilter.txt
    ...
  lib/...
  plugins/...

export NUTCH_CONFIG_DIR=$PWD/crawlB/conf:$NUTCH_HOME/conf
export NUTCH_LOG_DIR=$PWD/crawlB/conf

Now you run
 $NUTCH_HOME/bin/nutch
or
 $NUTCH_HOME/bin/crawl

As crawl-specific configuration files are:
 $PWD/crawB/conf/nutch-site.xml
 $PWD/crawB/conf/index-writers.xml
All remaining config files are picked from
 $NUTCH_HOME/conf
or if not found there from the Nutch job jar.

Note: that's how it works for a local Nutch installation
(Hadoop's local mode). On Hadoop (distributed mode) the
principles are similar but you usually want to built a
job-specific job jar anyway.


Best,
Sebastian

On 12/21/18 1:58 PM, hany.n...@hsbc.com wrote:
> Same issue here. What did you do with url regex & normalization?; these 
> configurations might be changed from site to another.
> 
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __________________________________________________________________ 
> Protect our environment - please only print this if you have to!
> 
> -----Original Message-----
> From: Lucas Reyes [mailto:tintanca...@gmail.com] 
> Sent: 20 December 2018 22:39
> To: user@nutch.apache.org
> Subject: nutch 1.15 index multiple cores with solr 7.5
> 
> I'm using nutch 1.15 and solr 7.5 with *the need to index multiple cores*.
> I have created separate crawldb and linkdb for each core, and then updated 
> index-writers.xml with multiple solr writers (each writer_id matching 
> corresponding core's name). Also, param name="url" points to each solr core, 
> but since there's no place to pass a param indicating the writer id nor the 
> solr core, bin/nutch index command indexes an specific crawldb against all 
> cores. Of course, I need to only index crawldb1 to core1, and so on.
> 
> Any suggestion on resolving this?
> 
> Thanks in advance.
> 
> 
> ***************************************************
> This message originated from the Internet. Its originator may or may not be 
> who they claim to be and the information contained in the message and any 
> attachments may or may not be accurate.
> ****************************************************
> 
>  
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 

Reply via email to