I have a single Nutch 2.x install with Solr, and it indexes a group of
sites fine.
Now I have a totally separate set of sites, and want to index these to a
separate Solr core so that searches in one group can't pick up results
from the other.
I see how to use the NUTCH_CONF_DIR environment variable to swap in a
different config for each call to 'crawl' so I can give a different set
of filters and 'crawl' already takes as an argument the destination Solr
core.
But I'm still finding (from a faceted search for 'host') that sites from
the other group are entering the Solr index.
I found an old mailing list post that talked about adding "-D
urlfilter.regex.file=regex-urlfilter-index.txt" to the "nutch index"
call in bin/crawl and then putting a regexp list of the hosts that
should be added to Solr into $NUTCH_CONF_DIR/regex-urlfilter-index.txt
but this doesn't seem to be obayed (documents that do not match the
expression are in the Solr index.
I don't need a separate HBase or something do I ? I'm happy to share the
in/out link data and fetches in HBase between sites, just not the
eventual index.
--
*Tom Chiverton*
Lead Developer
e: [email protected] <mailto:[email protected]>
p: 0161 817 2922
t: @extravision <http://www.twitter.com/extravision>
w: www.extravision.com <http://www.extravision.com/>
Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
Manchester, M15 4LD.
Company Reg No: 05017214 VAT: GB 824 5386 19
This e-mail is intended solely for the person to whom it is addressed
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author
and do not necessarily represent those of Extravision Ltd.