Have a look at the subcollection plugin - I haven't used it myself but I
think it does what you need

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 26 August 2010 19:03, Nemani, Raj <[email protected]> wrote:

> All,
>
>
>
> I was wondering if I can force a constant value into one of the fields
> defined in Nutch's schema.  Here is the scenario.
>
>
>
> I have two sub-sites that I would like to crawl separately.  Something
> like
>
>
>
> http://parentsite.mydomain.com/site1/index.php
>
>
>
> http://parentsite.mydomain.com/site12/index.php
>
>
>
>
>
> I am sending the results of the crawl to the same Solr/Lucene index.
> The Index is used by a drupal website to provide search results to the
> user.
>
>
>
> The user has checkboxes on the drupal website to search for either Site1
> search results or site 2 search results.
>
>
>
> Here is the problem.  There is no way for me to differentiate between
> site1 and site2 documents in the index.
>
>
>
> One of the Schema fields generated by the Nutch document is called
> 'site'.  Ideally this should have been a good field for me to use to
> differentiate
>
> between the documents in the index.  But for the sub-sites I am crawling
> the 'Site' field value will be set to "parentsite.mydomain.com" because
> both the urls have the same site value.
>
>
>
> That is reason for me ask this question.  Can I set the value of 'Site"
> field to "Site1" for Site1 url and "Site2" for site 2 url crawls.
>
>
>
> Hope I have explained the scenario clearly.  If what I am thinking is
> not possible then can I  achieve my ultimate objective in any other way.
>
>
>
> Thanks so much in advance
>
> Raj
>
>
>
>
>
>

Reply via email to