Hi,
On 06/22/2016 02:34 PM, Jigal van Hemert | alterNET internet BV wrote:
Hi,
...
Well, actually you can (with some restrictions). The environment variable
NUTCH_CONF_DIR can point to the directory where the configuration is
located. I have for example cron jobs like this:
Nice ! I didn't know that I could do this this way.
I'm using a similar directory structure, except that nutch jobs are in a
directory away from nutch software.
I was just thinking on to copy to the global conf dir the needed files at each run, but I wasn't satisfied because of
serialisation problem.
I'll test this and try to validate concurrent runs.
Thanks to all
José-Marcio
cd /opt/solr-tomcat/nutchdirectory/
export JAVA_HOME=/usr/lib/jvm/jre
export NUTCH_CONF_DIR=/opt/solr-tomcat/nutchdirectory/configurations/core_one
bin/crawl urls/core_one crawls/core_one 127.0.0.1/solr/core_one 3
/dev/null 2>&1
In the directory where nutch is installed I have extra directories urls,
crawls, configurations where each job has separate subdirectories. All
files from the normal conf directory were copied to each of the
configuration directories and customized for each job.
The only restriction is that I haven't been able to make sure that the
environment variables of different cron jobs aren't affecting the other
cron jobs. Therefore I make sure they run in sequence.
With only 4 jobs (for the development, test, accepting and production
stages of a website) they are all executed during the night, so there is no
problem.
Markus
-----Original message-----
From:Jose-Marcio Martins da Cruz <[email protected]
Sent: Tuesday 21st June 2016 11:50
To: [email protected]
Subject: nutch 1.12 - different options for each crawldb
Hello,
I'm using nutch 1.12/solr to index sites of our organisation, and I'd
like to divide them in some different classes,
e.g. public and private servers.
This works fine with different crawldb databases, each one with its own
set of seeds.
But I'd like to have different configuration files, e.g.,
regex-urlfilter.txt, nutch-site.xml, ... or, eventually, have
one "conf" directory for crawldb
Is it possible and if yes, how can I do this ?
Thanks for your help.
Regards
--
--
Envoyé de ma machine à écrire.
---------------------------------------------------------------
Spam : Classement statistique de messages électroniques -
Une approche pragmatique
Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
---------------------------------------------------------------
Jose Marcio MARTINS DA CRUZ http://www.j-chkmail.org
Ecole des Mines de Paris http://bit.ly/SpamJM
60, bd Saint Michel 75272 - PARIS CEDEX 06