Hi Magnus

> -----Original Message-----
> From: Magnús Skúlason [mailto:[email protected]]
> Sent: Wednesday, 13 June 2012 1:57 AM
> To: [email protected]
> Subject: focused crawl extended with user generated content
> 
> Hi,
> 
> I am using nutch for a focused crawl vertical search engine, so far I
> am only extracting information to be stored in the index in the crawl
> process. However I would like to allow users to edit and extend the
> content showed on my site. Like adding a better description, adding
> tags and sorting items into categories.
> 
> What would be the best approach to do that? If I simply store the
> additional information in the index what happens next time when a page
> is re indexed? Would the user generated content be overwritten?

If you store your additional information as extra fields that you add to Nutch 
documents before sending them to Solr, yes, this content will be overwritten. 
You can store it separately from your Nutch document, even in the same Solr 
index. Then it will not be overwritten by Nutch, but will be less trivial to 
search and retrieve together with Nutch index entries. 

> If so what would be the best way to prevent that? creating a solr pluggin
> (that would not re index documents that have been modified externally)
> or shhould I maybe store the user generated content in a database
> instead and flash the index with the information from the database
> after each crawl if changed? Something completely different?

Should you decide to add your extra information to Nutch documents, you can do 
it in Nutch index filter plugin. You will have to add it each time you re-index 
your documents. To do that, you can either maintain it separately in a database 
(including same Solr index, just different Ids), or get it from the old 
document in Solr about to be replaced, and copy to the new document. 

What exactly is optimal to do depends on what you are trying to achieve. 

> Are there already some plugins for nutch or solr to do something like
> this?

AFAIK, there are none to do exactly this, but the index-more plugin will give 
you an example of how to add extra fields. You will also have to extend Solr 
schema (see schema.xml) and Nutch->Solr mapping (see solrindex-mapping.xml).

Regards,

Arkadi
> 
> Any thoughts and / or best practices on this would be greatly
> appreciated :)
> 
> best regards,
> Magnus

Reply via email to