But not sure why these type of search string is causing high cpu utilization.
On Fri, 18 Sep, 2020, 12:49 am Rahul Goswami, <rahul196...@gmail.com> wrote: > Is this for a phrase search? If yes then the position of the token would > matter too and not sure which token would you want to remove. "eg > "tshirt hat tshirt". > Also, are you looking to save space and want this at index time? Or just > want to remove duplicates from the search string? > > If this is at search time AND is not a phrase search, there are a couple > approaches I could think of : > > 1) You could either handle this in the application layer to only pass the > deduplicated string before it hits solr > 2) You can write a custom search component and configure it in the > <first-components> list to process the search string and remove duplicates > before it hits the default search components. See here ( > > https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components > ). > > However if for search, I would still evaluate if writing those extra lines > of code is worth the investment. I say so since my assumption is that for > duplicated tokens in search string, lucene would have the intelligence to > not fetch the doc ids again, so you should not be worried about spending > computation resources to reevaluate the same tokens (Someone correct me if > I am wrong!) > > -Rahul > > On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <rajdeepsahoo2...@gmail.com> > wrote: > > > If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt" > > we need to remove the duplicates and search with tshirt. > > > > > > On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, < > arafa...@gmail.com> > > wrote: > > > > > This is not quite enough information. > > > There is > > > > > > https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter > > > but it has specific limitations. > > > > > > What is the problem that you are trying to solve that you feel is due > > > to duplicate tokens? Why are they duplicates? Is it about storage or > > > relevancy? > > > > > > Regards, > > > Alex. > > > > > > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo < > rajdeepsahoo2...@gmail.com> > > > wrote: > > > > > > > > Hi team, > > > > Is there any way to remove duplicate tokens from solr. Is there any > > > filter > > > > for this. > > > > > >