All, I was able to find the steps to set this plugin up. So I am good there. I do have one question. I running 1.1 Nutch. I believe I have setup every this correctly. I can see the Subcollection plugin getting registered in hadopp.log. Bit I cannot find the "subcollection" fileld in the index (seen using Luke).
Based on some of the emails from the archives of the list there are no know problems with this plugin in 1.1. I will include my subsollections.xml and the plugins.include (from nutch-site.xml) below. But my question is there any special tirck to have the logging enabled for plugins. This is what I did in lo4j.properties to turn on the logging for subcollection plugin classes in the log4j.properties file. log4j.logger.org.apache.nutch.collection.CollectionManager=INFO,cmdstdou t log4j.logger.org.apache.nutch.searcher.subcollection.SubcollectionQueryF ilter=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.subcollection.SubcollectionIndexin gFilter=INFO,cmdstdout I even tried DRFA in place to cmdstdout hoping that I will see the log statements from these classes in hadoop.log. But nothing seems to work. Other classes setup similar (as shown below) seem to work fine and produce log statements in cmdstdout I am a .Net dev and have used log4net so I could be missing something with log4J -----Original Message----- From: Nemani, Raj [mailto:[email protected]] Sent: Friday, August 27, 2010 4:14 PM To: [email protected] Subject: RE: Setting the Nutchschema field to a constant value Thank you Julien. I was trying to look fora some documentation on how to set this plugin up. Can anybody point me to a link where the setup is documented. I appreciate your help. Raj -----Original Message----- From: Julien Nioche [mailto:[email protected]] Sent: Friday, August 27, 2010 4:42 AM To: [email protected] Subject: Re: Setting the Nutchschema field to a constant value Have a look at the subcollection plugin - I haven't used it myself but I think it does what you need Julien -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com On 26 August 2010 19:03, Nemani, Raj <[email protected]> wrote: > All, > > > > I was wondering if I can force a constant value into one of the fields > defined in Nutch's schema. Here is the scenario. > > > > I have two sub-sites that I would like to crawl separately. Something > like > > > > http://parentsite.mydomain.com/site1/index.php > > > > http://parentsite.mydomain.com/site12/index.php > > > > > > I am sending the results of the crawl to the same Solr/Lucene index. > The Index is used by a drupal website to provide search results to the > user. > > > > The user has checkboxes on the drupal website to search for either Site1 > search results or site 2 search results. > > > > Here is the problem. There is no way for me to differentiate between > site1 and site2 documents in the index. > > > > One of the Schema fields generated by the Nutch document is called > 'site'. Ideally this should have been a good field for me to use to > differentiate > > between the documents in the index. But for the sub-sites I am crawling > the 'Site' field value will be set to "parentsite.mydomain.com" because > both the urls have the same site value. > > > > That is reason for me ask this question. Can I set the value of 'Site" > field to "Site1" for Site1 url and "Site2" for site 2 url crawls. > > > > Hope I have explained the scenario clearly. If what I am thinking is > not possible then can I achieve my ultimate objective in any other way. > > > > Thanks so much in advance > > Raj > > > > > >

