Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Jigar Shah
My learnings dealing this problem We faced a similar problem before, and did the following things: 1) Don't request totalGroupCount, and the response was fast. as computing group count is an expensive task. If you can live without groupCount. Although you can approximate pagination up to total

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
As Erick said, can you tell us a bit more about the use case? There might be another way to achieve the same result. What are these documents? Why do you need 1000 docs per user? From: java-user@lucene.apache.org At: 10/09/20 14:25:02To: java-user@lucene.apache.org Subject: Re:

Re: MultiFieldQueryParser on integer and string (8.6.0)

2020-10-09 Thread Stephane Passignat
Hi, it seems I do not raise a lot of interest here... anyway I try again with a simpler question. Is MultiFieldQueryParser usable in 8.6.0 ? thanks Message initial De: Stephane Passignat Répondre à: java-user@lucene.apache.org À: java-user@lucene.apache.org Objet:

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Dmitry Emets
6_500_000 is the total count of groups in the entire collection. I only return the top 1000 to users. I use Lucene where I have documents that can have the same docvalue, and I want to deduplicate this documents by this docvalue during search. Also, i sort my documents by multiple fields and

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Erick Erickson
This is going to be fairly painful. You need to keep a list 6.5M items long, sorted. Before diving in there, I’d really back up and ask what the use-case is. Returning 6.5M docs to a user is useless, so are you’re doing some kind of analytics maybe? In which case, and again assuming you’re using

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Dmitry Emets
I have 12_000_000 documents, 6_500_000 groups With sort: It takes around 1 sec without grouping, 2 sec with grouping and 12 sec with setAllGroups(true) Without sort: It takes around 0.2 sec without grouping, 0.6 sec with grouping and 10 sec with setAllGroups(true) Thank you, Erick, I will look

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Erick Erickson
At the Solr level, CollapsingQParserPlugin see: https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html You could perhaps steal some ideas from that if you need this at the Lucene level. Best, Erick > On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) > wrote:

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
How many documents in the collection, how many groups, and how long is it taking to do the grouping vs no grouping? Also, if you remove the custom sort is it still slow? From: java-user@lucene.apache.org At: 10/09/20 12:27:25To: Diego Ceccarelli (BLOOMBERG/ LONDON ) ,

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Dmitry Emets
Yes, it is пт, 9 окт. 2020 г. в 14:25, Diego Ceccarelli (BLOOMBERG/ LONDON) < dceccarel...@bloomberg.net>: > Is the field that you are using to dedupe stored as a docvalue? > > From: java-user@lucene.apache.org At: 10/09/20 12:18:04To: > java-user@lucene.apache.org > Subject: Deduplication of

Re:Deduplication of search result with custom with custom sort

2020-10-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Is the field that you are using to dedupe stored as a docvalue? From: java-user@lucene.apache.org At: 10/09/20 12:18:04To: java-user@lucene.apache.org Subject: Deduplication of search result with custom with custom sort Hi, I need to deduplicate search results by specific field and I have no

Deduplication of search result with custom with custom sort

2020-10-09 Thread Dmitry Emets
Hi, I need to deduplicate search results by specific field and I have no idea how to implement this properly. I have tried grouping with setGroupDocsLimit(1) and it gives me expected results, but has not very good performance. I think that I need something like DiversifiedTopDocsCollector, but