Re: Changing nutch for update documents instead of add new ones

Ali Nazemian Mon, 07 Jul 2014 22:51:29 -0700

Dear Jorge,
Thank you very much for your nice reply, but unfortunately I am using solr
in distributed way (SolrCloud with multi sharding) so probably I can not
use your solution. Because of the multi-sharding fact I also unable to use
cross doc/core join unless I will go with join from the first place.
I already post a question about my problem in stackoverflow, Would you
please take a look at that for understanding the application and main
problem.


http://stackoverflow.com/questions/24608766/solr-overwrite-on-just-some-fields-with-duplicate-uniquekeys


Best regards.


On Mon, Jul 7, 2014 at 7:58 PM, Jorge Luis Betancourt Gonzalez <
[email protected]> wrote:

> Sure, no problem.
>
> We’re using Solr 3.6 so we updated a custom UpdateRequestProcessor this
> extension point provided by solr allows you to plugin custom logic in the
> ingestion process of your data (documents) into Solr.
>
> Basically this plugin fires a query using the uniqueKey field against the
> index, then reads the values from this doc and use this values to update
> the values of the new document that is going into the index. So let’s say
> you’re incrementing a counter you send a new document with/without the
> counter field, then the plugin use the uniqueKey field to fire a query
> against the inverted index in Solr, retrieving the stored value of the
> count field in the index, and add/set the count field of the new document
> accordingly. Of douse this by itself it will just add a new document into
> the index, but combined with the dedup component, then the documents gets
> updated.
>
> The inverted index in Solr does not know the original value of any field
> instead you define the field in question as stored (which in our case we
> do, and this is why this plugin can retrieve the original value, lets say
> is a counter and increment its value).
>
> Of course this is working now with Solr 3.6, in Solr 4 there are a few new
> features, including scripts that gets evaluated at index/query time that
> allows you to plug your logic in a very similar way, I haven’t tested this
> approach yet, but I suppose that could work (also don’t know if you can
> retrieve data with this scripts from the index, i.e fire a query). One more
> thing is that this approach it’s no SolrCloud ready, although I suppose
> that it could work, my only concern is with latency, if for each document
> you need to request data from the hole cluster, this could be a bottleneck.
> I don’t know your specific requirement but hopes this helps. If you share
> more of your architecture I’m sure more people could help.
>
> Regards,
>
> On Jul 7, 2014, at 8:45 AM, Ali Nazemian <[email protected]> wrote:
>
> > Dear Jorge,
> > Hi,
> > Could you please tell me more about this solr plugin? Do you have that?
> > Regards.
> >
> >
> > On Wed, Jul 2, 2014 at 9:44 AM, Jorge Luis Betancourt Gonzalez <
> > [email protected]> wrote:
> >
> >> Sometime ago for a very particular use case we abstracted this
> >> responsability  into a custom Solr plugin for a few stored fields. it
> would
> >> handle this case, (don’t just updating a date field, but also keeping a
> >> counter on how many times an url is indexed). Of course you need stored
> >> fields for this and yet under the hood a document gets deleted and
> added.
> >>
> >> On Jul 1, 2014, at 9:54 AM, Markus Jelsma <[email protected]>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> NutchIndexAction is indeed prepared to handle updates but the methods
> >> are not implemented. In case of Solr, it still does an internal
> add/delete
> >> for updated documents, and to do so, you must have all fields
> >> stored="true". So in almost all cases, it is more efficient not to store
> >> all fields and send some additional data over the wire. You can
> implement
> >> it though.
> >>>
> >>> Markus
> >>>
> >>> -----Original message-----
> >>>> From:Ali Nazemian <[email protected]>
> >>>> Sent: Tuesday 1st July 2014 15:31
> >>>> To: [email protected]
> >>>> Subject: Changing nutch for update documents instead of add new ones
> >>>>
> >>>> Dears,
> >>>> Hi,
> >>>> I am going to do some changes in nutch default behavior. I want to
> >> change
> >>>> nutch solr index (indexWriter class) in a way that instead of adding
> new
> >>>> document to solr, old documents are updated. I saw an "update" method
> >>>> inside this class. Is that implemented for this purpose? If no what is
> >> the
> >>>> purpose of this method? Another question is doing such thing (changing
> >>>> indexWriter to update document instead of adding them) would affect my
> >>>> performance for whole web crawling?
> >>>> Best regards.
> >>>>
> >>>> --
> >>>> A.Nazemian
> >>>>
> >>
> >> VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de
> >> julio de 2014. Ver www.uci.cu
> >>
> >
> >
> >
> > --
> > A.Nazemian
>
> VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de
> julio de 2014. Ver www.uci.cu
>



-- 
A.Nazemian

Re: Changing nutch for update documents instead of add new ones

Reply via email to