Hi Magnus > -----Original Message----- > From: Magnús Skúlason [mailto:[email protected]] > Sent: Wednesday, 13 June 2012 1:57 AM > To: [email protected] > Subject: focused crawl extended with user generated content > > Hi, > > I am using nutch for a focused crawl vertical search engine, so far I > am only extracting information to be stored in the index in the crawl > process. However I would like to allow users to edit and extend the > content showed on my site. Like adding a better description, adding > tags and sorting items into categories. > > What would be the best approach to do that? If I simply store the > additional information in the index what happens next time when a page > is re indexed? Would the user generated content be overwritten?
If you store your additional information as extra fields that you add to Nutch documents before sending them to Solr, yes, this content will be overwritten. You can store it separately from your Nutch document, even in the same Solr index. Then it will not be overwritten by Nutch, but will be less trivial to search and retrieve together with Nutch index entries. > If so what would be the best way to prevent that? creating a solr pluggin > (that would not re index documents that have been modified externally) > or shhould I maybe store the user generated content in a database > instead and flash the index with the information from the database > after each crawl if changed? Something completely different? Should you decide to add your extra information to Nutch documents, you can do it in Nutch index filter plugin. You will have to add it each time you re-index your documents. To do that, you can either maintain it separately in a database (including same Solr index, just different Ids), or get it from the old document in Solr about to be replaced, and copy to the new document. What exactly is optimal to do depends on what you are trying to achieve. > Are there already some plugins for nutch or solr to do something like > this? AFAIK, there are none to do exactly this, but the index-more plugin will give you an example of how to add extra fields. You will also have to extend Solr schema (see schema.xml) and Nutch->Solr mapping (see solrindex-mapping.xml). Regards, Arkadi > > Any thoughts and / or best practices on this would be greatly > appreciated :) > > best regards, > Magnus

