Re: Changing nutch for update documents instead of add new ones

Jorge Luis Betancourt Gonzalez Mon, 07 Jul 2014 08:29:26 -0700

Sure, no problem.

We’re using Solr 3.6 so we updated a custom UpdateRequestProcessor this 
extension point provided by solr allows you to plugin custom logic in the 
ingestion process of your data (documents) into Solr.

Basically this plugin fires a query using the uniqueKey field against the 
index, then reads the values from this doc and use this values to update the 
values of the new document that is going into the index. So let’s say you’re 
incrementing a counter you send a new document with/without the counter field, 
then the plugin use the uniqueKey field to fire a query against the inverted 
index in Solr, retrieving the stored value of the count field in the index, and 
add/set the count field of the new document accordingly. Of douse this by 
itself it will just add a new document into the index, but combined with the 
dedup component, then the documents gets updated.

The inverted index in Solr does not know the original value of any field 
instead you define the field in question as stored (which in our case we do, 
and this is why this plugin can retrieve the original value, lets say is a 
counter and increment its value).

Of course this is working now with Solr 3.6, in Solr 4 there are a few new 
features, including scripts that gets evaluated at index/query time that allows 
you to plug your logic in a very similar way, I haven’t tested this approach 
yet, but I suppose that could work (also don’t know if you can retrieve data 
with this scripts from the index, i.e fire a query). One more thing is that 
this approach it’s no SolrCloud ready, although I suppose that it could work, 
my only concern is with latency, if for each document you need to request data 
from the hole cluster, this could be a bottleneck. I don’t know your specific 
requirement but hopes this helps. If you share more of your architecture I’m 
sure more people could help.

Regards,

On Jul 7, 2014, at 8:45 AM, Ali Nazemian <[email protected]> wrote:

> Dear Jorge,
> Hi,
> Could you please tell me more about this solr plugin? Do you have that?
> Regards.
> 
> 
> On Wed, Jul 2, 2014 at 9:44 AM, Jorge Luis Betancourt Gonzalez <
> [email protected]> wrote:
> 
>> Sometime ago for a very particular use case we abstracted this
>> responsability  into a custom Solr plugin for a few stored fields. it would
>> handle this case, (don’t just updating a date field, but also keeping a
>> counter on how many times an url is indexed). Of course you need stored
>> fields for this and yet under the hood a document gets deleted and added.
>> 
>> On Jul 1, 2014, at 9:54 AM, Markus Jelsma <[email protected]>
>> wrote:
>> 
>>> Hi,
>>> 
>>> NutchIndexAction is indeed prepared to handle updates but the methods
>> are not implemented. In case of Solr, it still does an internal add/delete
>> for updated documents, and to do so, you must have all fields
>> stored="true". So in almost all cases, it is more efficient not to store
>> all fields and send some additional data over the wire. You can implement
>> it though.
>>> 
>>> Markus
>>> 
>>> -----Original message-----
>>>> From:Ali Nazemian <[email protected]>
>>>> Sent: Tuesday 1st July 2014 15:31
>>>> To: [email protected]
>>>> Subject: Changing nutch for update documents instead of add new ones
>>>> 
>>>> Dears,
>>>> Hi,
>>>> I am going to do some changes in nutch default behavior. I want to
>> change
>>>> nutch solr index (indexWriter class) in a way that instead of adding new
>>>> document to solr, old documents are updated. I saw an "update" method
>>>> inside this class. Is that implemented for this purpose? If no what is
>> the
>>>> purpose of this method? Another question is doing such thing (changing
>>>> indexWriter to update document instead of adding them) would affect my
>>>> performance for whole web crawling?
>>>> Best regards.
>>>> 
>>>> --
>>>> A.Nazemian
>>>> 
>> 
>> VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de
>> julio de 2014. Ver www.uci.cu
>> 
> 
> 
> 
> -- 
> A.Nazemian

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Re: Changing nutch for update documents instead of add new ones

Reply via email to