Hi, thanks for the replies
"or get it from the old document in Solr about to be replaced, and copy to the new document." --> this sounds alot like something I would want to do, what would be the best way of doing this. Writing a solr updateRequestProcessorChain (or some other type) plugin maybe? It would check if there is already a similar document with fields that should be preserved and if so copy them over to the new document. Does solr even support index filters like this, I am not that familiar with writing solr extensions. Would it be very resource expensive on the solr side? I would rather do this than query for similar documents on the nutch side before indexing them, sounds like alot of overhead, especially since I expect the minority of documents to have this kind of stored information. best regards, Magnus On Wed, Jun 13, 2012 at 1:00 AM, <[email protected]> wrote: > Hi Magnus > >> -----Original Message----- >> From: Magnús Skúlason [mailto:[email protected]] >> Sent: Wednesday, 13 June 2012 1:57 AM >> To: [email protected] >> Subject: focused crawl extended with user generated content >> >> Hi, >> >> I am using nutch for a focused crawl vertical search engine, so far I >> am only extracting information to be stored in the index in the crawl >> process. However I would like to allow users to edit and extend the >> content showed on my site. Like adding a better description, adding >> tags and sorting items into categories. >> >> What would be the best approach to do that? If I simply store the >> additional information in the index what happens next time when a page >> is re indexed? Would the user generated content be overwritten? > > If you store your additional information as extra fields that you add to > Nutch documents before sending them to Solr, yes, this content will be > overwritten. You can store it separately from your Nutch document, even in > the same Solr index. Then it will not be overwritten by Nutch, but will be > less trivial to search and retrieve together with Nutch index entries. > >> If so what would be the best way to prevent that? creating a solr pluggin >> (that would not re index documents that have been modified externally) >> or shhould I maybe store the user generated content in a database >> instead and flash the index with the information from the database >> after each crawl if changed? Something completely different? > > Should you decide to add your extra information to Nutch documents, you can > do it in Nutch index filter plugin. You will have to add it each time you > re-index your documents. To do that, you can either maintain it separately in > a database (including same Solr index, just different Ids), or get it from > the old document in Solr about to be replaced, and copy to the new document. > > What exactly is optimal to do depends on what you are trying to achieve. > >> Are there already some plugins for nutch or solr to do something like >> this? > > AFAIK, there are none to do exactly this, but the index-more plugin will give > you an example of how to add extra fields. You will also have to extend Solr > schema (see schema.xml) and Nutch->Solr mapping (see solrindex-mapping.xml). > > Regards, > > Arkadi >> >> Any thoughts and / or best practices on this would be greatly >> appreciated :) >> >> best regards, >> Magnus

