Sure, no problem. We’re using Solr 3.6 so we updated a custom UpdateRequestProcessor this extension point provided by solr allows you to plugin custom logic in the ingestion process of your data (documents) into Solr.
Basically this plugin fires a query using the uniqueKey field against the index, then reads the values from this doc and use this values to update the values of the new document that is going into the index. So let’s say you’re incrementing a counter you send a new document with/without the counter field, then the plugin use the uniqueKey field to fire a query against the inverted index in Solr, retrieving the stored value of the count field in the index, and add/set the count field of the new document accordingly. Of douse this by itself it will just add a new document into the index, but combined with the dedup component, then the documents gets updated. The inverted index in Solr does not know the original value of any field instead you define the field in question as stored (which in our case we do, and this is why this plugin can retrieve the original value, lets say is a counter and increment its value). Of course this is working now with Solr 3.6, in Solr 4 there are a few new features, including scripts that gets evaluated at index/query time that allows you to plug your logic in a very similar way, I haven’t tested this approach yet, but I suppose that could work (also don’t know if you can retrieve data with this scripts from the index, i.e fire a query). One more thing is that this approach it’s no SolrCloud ready, although I suppose that it could work, my only concern is with latency, if for each document you need to request data from the hole cluster, this could be a bottleneck. I don’t know your specific requirement but hopes this helps. If you share more of your architecture I’m sure more people could help. Regards, On Jul 7, 2014, at 8:45 AM, Ali Nazemian <[email protected]> wrote: > Dear Jorge, > Hi, > Could you please tell me more about this solr plugin? Do you have that? > Regards. > > > On Wed, Jul 2, 2014 at 9:44 AM, Jorge Luis Betancourt Gonzalez < > [email protected]> wrote: > >> Sometime ago for a very particular use case we abstracted this >> responsability into a custom Solr plugin for a few stored fields. it would >> handle this case, (don’t just updating a date field, but also keeping a >> counter on how many times an url is indexed). Of course you need stored >> fields for this and yet under the hood a document gets deleted and added. >> >> On Jul 1, 2014, at 9:54 AM, Markus Jelsma <[email protected]> >> wrote: >> >>> Hi, >>> >>> NutchIndexAction is indeed prepared to handle updates but the methods >> are not implemented. In case of Solr, it still does an internal add/delete >> for updated documents, and to do so, you must have all fields >> stored="true". So in almost all cases, it is more efficient not to store >> all fields and send some additional data over the wire. You can implement >> it though. >>> >>> Markus >>> >>> -----Original message----- >>>> From:Ali Nazemian <[email protected]> >>>> Sent: Tuesday 1st July 2014 15:31 >>>> To: [email protected] >>>> Subject: Changing nutch for update documents instead of add new ones >>>> >>>> Dears, >>>> Hi, >>>> I am going to do some changes in nutch default behavior. I want to >> change >>>> nutch solr index (indexWriter class) in a way that instead of adding new >>>> document to solr, old documents are updated. I saw an "update" method >>>> inside this class. Is that implemented for this purpose? If no what is >> the >>>> purpose of this method? Another question is doing such thing (changing >>>> indexWriter to update document instead of adding them) would affect my >>>> performance for whole web crawling? >>>> Best regards. >>>> >>>> -- >>>> A.Nazemian >>>> >> >> VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de >> julio de 2014. Ver www.uci.cu >> > > > > -- > A.Nazemian VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

