All,

 

I was wondering if I can force a constant value into one of the fields
defined in Nutch's schema.  Here is the scenario.

 

I have two sub-sites that I would like to crawl separately.  Something
like

 

http://parentsite.mydomain.com/site1/index.php

 

http://parentsite.mydomain.com/site12/index.php

 

 

I am sending the results of the crawl to the same Solr/Lucene index.
The Index is used by a drupal website to provide search results to the
user.

 

The user has checkboxes on the drupal website to search for either Site1
search results or site 2 search results.  

 

Here is the problem.  There is no way for me to differentiate between
site1 and site2 documents in the index.

 

One of the Schema fields generated by the Nutch document is called
'site'.  Ideally this should have been a good field for me to use to
differentiate

between the documents in the index.  But for the sub-sites I am crawling
the 'Site' field value will be set to "parentsite.mydomain.com" because
both the urls have the same site value.

 

That is reason for me ask this question.  Can I set the value of 'Site"
field to "Site1" for Site1 url and "Site2" for site 2 url crawls.

 

Hope I have explained the scenario clearly.  If what I am thinking is
not possible then can I  achieve my ultimate objective in any other way.

 

Thanks so much in advance

Raj

 

 

Reply via email to