Dear Jorge, Thank you very much for your nice reply, but unfortunately I am using solr in distributed way (SolrCloud with multi sharding) so probably I can not use your solution. Because of the multi-sharding fact I also unable to use cross doc/core join unless I will go with join from the first place. I already post a question about my problem in stackoverflow, Would you please take a look at that for understanding the application and main problem.
http://stackoverflow.com/questions/24608766/solr-overwrite-on-just-some-fields-with-duplicate-uniquekeys Best regards. On Mon, Jul 7, 2014 at 7:58 PM, Jorge Luis Betancourt Gonzalez < [email protected]> wrote: > Sure, no problem. > > We’re using Solr 3.6 so we updated a custom UpdateRequestProcessor this > extension point provided by solr allows you to plugin custom logic in the > ingestion process of your data (documents) into Solr. > > Basically this plugin fires a query using the uniqueKey field against the > index, then reads the values from this doc and use this values to update > the values of the new document that is going into the index. So let’s say > you’re incrementing a counter you send a new document with/without the > counter field, then the plugin use the uniqueKey field to fire a query > against the inverted index in Solr, retrieving the stored value of the > count field in the index, and add/set the count field of the new document > accordingly. Of douse this by itself it will just add a new document into > the index, but combined with the dedup component, then the documents gets > updated. > > The inverted index in Solr does not know the original value of any field > instead you define the field in question as stored (which in our case we > do, and this is why this plugin can retrieve the original value, lets say > is a counter and increment its value). > > Of course this is working now with Solr 3.6, in Solr 4 there are a few new > features, including scripts that gets evaluated at index/query time that > allows you to plug your logic in a very similar way, I haven’t tested this > approach yet, but I suppose that could work (also don’t know if you can > retrieve data with this scripts from the index, i.e fire a query). One more > thing is that this approach it’s no SolrCloud ready, although I suppose > that it could work, my only concern is with latency, if for each document > you need to request data from the hole cluster, this could be a bottleneck. > I don’t know your specific requirement but hopes this helps. If you share > more of your architecture I’m sure more people could help. > > Regards, > > On Jul 7, 2014, at 8:45 AM, Ali Nazemian <[email protected]> wrote: > > > Dear Jorge, > > Hi, > > Could you please tell me more about this solr plugin? Do you have that? > > Regards. > > > > > > On Wed, Jul 2, 2014 at 9:44 AM, Jorge Luis Betancourt Gonzalez < > > [email protected]> wrote: > > > >> Sometime ago for a very particular use case we abstracted this > >> responsability into a custom Solr plugin for a few stored fields. it > would > >> handle this case, (don’t just updating a date field, but also keeping a > >> counter on how many times an url is indexed). Of course you need stored > >> fields for this and yet under the hood a document gets deleted and > added. > >> > >> On Jul 1, 2014, at 9:54 AM, Markus Jelsma <[email protected]> > >> wrote: > >> > >>> Hi, > >>> > >>> NutchIndexAction is indeed prepared to handle updates but the methods > >> are not implemented. In case of Solr, it still does an internal > add/delete > >> for updated documents, and to do so, you must have all fields > >> stored="true". So in almost all cases, it is more efficient not to store > >> all fields and send some additional data over the wire. You can > implement > >> it though. > >>> > >>> Markus > >>> > >>> -----Original message----- > >>>> From:Ali Nazemian <[email protected]> > >>>> Sent: Tuesday 1st July 2014 15:31 > >>>> To: [email protected] > >>>> Subject: Changing nutch for update documents instead of add new ones > >>>> > >>>> Dears, > >>>> Hi, > >>>> I am going to do some changes in nutch default behavior. I want to > >> change > >>>> nutch solr index (indexWriter class) in a way that instead of adding > new > >>>> document to solr, old documents are updated. I saw an "update" method > >>>> inside this class. Is that implemented for this purpose? If no what is > >> the > >>>> purpose of this method? Another question is doing such thing (changing > >>>> indexWriter to update document instead of adding them) would affect my > >>>> performance for whole web crawling? > >>>> Best regards. > >>>> > >>>> -- > >>>> A.Nazemian > >>>> > >> > >> VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de > >> julio de 2014. Ver www.uci.cu > >> > > > > > > > > -- > > A.Nazemian > > VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de > julio de 2014. Ver www.uci.cu > -- A.Nazemian

