Re: Deduplication in 1.4

Martijn v Groningen Thu, 26 Nov 2009 10:07:17 -0800

Two sites that use field-collapsing:
1) www.ilocal.nl
2) www.welke.nl
I'm not sure what you mean with double-tripping? The sites mentioned
do not have performance problems that are caused by field collapsing.


Field-collapsing currently only supports quasi distributed
field-collapsing (as I have described on the Solr wiki). Currently I
don't know a distributed field-collapsing algorithm that works
properly and does not influence the search time in such a way that the
search becomes slow.

Martijn

2009/11/26 Otis Gospodnetic <otis_gospodne...@yahoo.com>:
> Hi Martijn,
>
>
> ----- Original Message ----
>
>> From: Martijn v Groningen <martijn.is.h...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thu, November 26, 2009 3:19:40 AM
>> Subject: Re: Deduplication in 1.4
>>
>> Field collapsing has been used by many in their production
>> environment.
>
> Got any pointers to public sites you know use it?  I know of a high traffic 
> site that used an early version, and it caused performance problems.  Is 
> double-tripping still required?
>
>> The last few months the stability of the patch grew as
>> quiet some bugs were fixed. The only big feature missing currently is
>> caching of the collapsing algorithm. I'm currently working on that and
>
> Is it also full distributed-search-ready?
>
>> I will put it in a new patch in the coming next days.  So yes the
>> patch is very near being production ready.
>
> Thanks,
> Otis
>
>> Martijn
>>
>> 2009/11/26 KaktuChakarabati :
>> >
>> > Hey Otis,
>> > Yep, I realized this myself after playing some with the dedupe feature
>> > yesterday.
>> > So it does look like Field collapsing is what I need pretty much.
>> > Any idea on how close it is to being production-ready?
>> >
>> > Thanks,
>> > -Chak
>> >
>> > Otis Gospodnetic wrote:
>> >>
>> >> Hi,
>> >>
>> >> As far as I know, the point of deduplication in Solr (
>> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> >> document before indexing it in order to avoid duplicates in the index in
>> >> the first place.
>> >>
>> >> What you are describing is closer to field collapsing patch in SOLR-236.
>> >>
>> >>  Otis
>> >> --
>> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >>> From: KaktuChakarabati
>> >>> To: solr-user@lucene.apache.org
>> >>> Sent: Tue, November 24, 2009 5:29:00 PM
>> >>> Subject: Deduplication in 1.4
>> >>>
>> >>>
>> >>> Hey,
>> >>> I've been trying to find some documentation on using this feature in 1.4
>> >>> but
>> >>> Wiki page is alittle sparse..
>> >>> In specific, here's what i'm trying to do:
>> >>>
>> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>> >>> offline documents deduplication process I have.
>> >>>
>> >>> All I want is for solr to compute a 'duplicate_signature' field based on
>> >>> this one at update time, so that when i search for documents later, all
>> >>> documents with same original 'duplicate_group_id' value will be rolled up
>> >>> (e.g i'll just get the first one that came back  according to relevancy).
>> >>>
>> >>> I enabled the deduplication processor and put it into updater, but i'm
>> >>> not
>> >>> seeing any difference in returned results (i.e results with same
>> >>> duplicate_id are returned separately..)
>> >>>
>> >>> is there anything i need to supply in query-time for this to take effect?
>> >>> what should be the behaviour? is there any working example of this?
>> >>>
>> >>> Anything will be helpful..
>> >>>
>> >>> Thanks,
>> >>> Chak
>> >>> --
>> >>> View this message in context:
>> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >>
>> >
>> > --
>> > View this message in context:
>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>> >
>
>

Re: Deduplication in 1.4

Reply via email to