Hi,

thanks for the replies

"or get it from the old document in Solr about to be replaced, and
copy to the new document."
--> this sounds alot like something I would want to do, what would be
the best way of doing this. Writing a solr updateRequestProcessorChain
(or some other type) plugin maybe? It would check if there is already
a similar document with fields that should be preserved and if so copy
them over to the new document. Does solr even support index filters
like this, I am not that familiar with writing solr extensions. Would
it be very resource expensive on the solr side?

I would rather do this than query for similar documents on the nutch
side before indexing them, sounds like alot of overhead, especially
since I expect the minority of documents to have this kind of stored
information.

best regards,
Magnus

On Wed, Jun 13, 2012 at 1:00 AM,  <[email protected]> wrote:
> Hi Magnus
>
>> -----Original Message-----
>> From: Magnús Skúlason [mailto:[email protected]]
>> Sent: Wednesday, 13 June 2012 1:57 AM
>> To: [email protected]
>> Subject: focused crawl extended with user generated content
>>
>> Hi,
>>
>> I am using nutch for a focused crawl vertical search engine, so far I
>> am only extracting information to be stored in the index in the crawl
>> process. However I would like to allow users to edit and extend the
>> content showed on my site. Like adding a better description, adding
>> tags and sorting items into categories.
>>
>> What would be the best approach to do that? If I simply store the
>> additional information in the index what happens next time when a page
>> is re indexed? Would the user generated content be overwritten?
>
> If you store your additional information as extra fields that you add to 
> Nutch documents before sending them to Solr, yes, this content will be 
> overwritten. You can store it separately from your Nutch document, even in 
> the same Solr index. Then it will not be overwritten by Nutch, but will be 
> less trivial to search and retrieve together with Nutch index entries.
>
>> If so what would be the best way to prevent that? creating a solr pluggin
>> (that would not re index documents that have been modified externally)
>> or shhould I maybe store the user generated content in a database
>> instead and flash the index with the information from the database
>> after each crawl if changed? Something completely different?
>
> Should you decide to add your extra information to Nutch documents, you can 
> do it in Nutch index filter plugin. You will have to add it each time you 
> re-index your documents. To do that, you can either maintain it separately in 
> a database (including same Solr index, just different Ids), or get it from 
> the old document in Solr about to be replaced, and copy to the new document.
>
> What exactly is optimal to do depends on what you are trying to achieve.
>
>> Are there already some plugins for nutch or solr to do something like
>> this?
>
> AFAIK, there are none to do exactly this, but the index-more plugin will give 
> you an example of how to add extra fields. You will also have to extend Solr 
> schema (see schema.xml) and Nutch->Solr mapping (see solrindex-mapping.xml).
>
> Regards,
>
> Arkadi
>>
>> Any thoughts and / or best practices on this would be greatly
>> appreciated :)
>>
>> best regards,
>> Magnus

Reply via email to