Re: How to remove duplicate tokens from solr

Rajdeep Sahoo Thu, 17 Sep 2020 12:35:25 -0700

But not sure why these type of search string is causing high cpu
utilization.


On Fri, 18 Sep, 2020, 12:49 am Rahul Goswami, <rahul196...@gmail.com> wrote:

> Is this for a phrase search? If yes then the position of the token would
> matter too and not sure which token would you want to remove. "eg
> "tshirt hat tshirt".
> Also, are you looking to save space and want this at index time? Or just
> want to remove duplicates from the search string?
>
> If this is at search time AND is not a phrase search, there are a couple
> approaches I could think of :
>
> 1) You could either handle this in the application layer to only pass the
> deduplicated string before it hits solr
> 2) You can write a custom search component and configure it in the
>  <first-components> list to process the search string and remove duplicates
> before it hits the default search components. See here (
>
> https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components
> ).
>
> However if for search, I would still evaluate if writing those extra lines
> of code is worth the investment. I say so since my assumption is that for
> duplicated tokens in search string, lucene would have the intelligence to
> not fetch the doc ids again, so you should not be worried about spending
> computation resources to reevaluate the same tokens (Someone correct me if
> I am wrong!)
>
> -Rahul
>
> On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <rajdeepsahoo2...@gmail.com>
> wrote:
>
> > If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt"
> > we need to remove the duplicates and search with tshirt.
> >
> >
> > On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, <
> arafa...@gmail.com>
> > wrote:
> >
> > > This is not quite enough information.
> > > There is
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
> > > but it has specific limitations.
> > >
> > > What is the problem that you are trying to solve that you feel is due
> > > to duplicate tokens? Why are they duplicates? Is it about storage or
> > > relevancy?
> > >
> > > Regards,
> > >    Alex.
> > >
> > > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <
> rajdeepsahoo2...@gmail.com>
> > > wrote:
> > > >
> > > > Hi team,
> > > >  Is there any way to remove duplicate tokens from solr. Is there any
> > > filter
> > > > for this.
> > >
> >
>

Re: How to remove duplicate tokens from solr

Reply via email to