I have a single Nutch 2.x install with Solr, and it indexes a group of sites fine.

Now I have a totally separate set of sites, and want to index these to a separate Solr core so that searches in one group can't pick up results from the other.


I see how to use the NUTCH_CONF_DIR environment variable to swap in a different config for each call to 'crawl' so I can give a different set of filters and 'crawl' already takes as an argument the destination Solr core.


But I'm still finding (from a faceted search for 'host') that sites from the other group are entering the Solr index.


I found an old mailing list post that talked about adding "-D urlfilter.regex.file=regex-urlfilter-index.txt" to the "nutch index" call in bin/crawl and then putting a regexp list of the hosts that should be added to Solr into $NUTCH_CONF_DIR/regex-urlfilter-index.txt but this doesn't seem to be obayed (documents that do not match the expression are in the Solr index.


I don't need a separate HBase or something do I ? I'm happy to share the in/out link data and fetches in HBase between sites, just not the eventual index.


--
*Tom Chiverton*
Lead Developer
e:      [email protected] <mailto:[email protected]>
p:      0161 817 2922
t:      @extravision <http://www.twitter.com/extravision>
w:      www.extravision.com <http://www.extravision.com/>

Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 4LD.
Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed and may contain confidential or privileged information. Any views or opinions presented in this e-mail are solely of the author and do not necessarily represent those of Extravision Ltd.

Reply via email to