Re: Can solr do the equivalent of "select distinct(field)"?

Aleksander Stensby Thu, 17 Dec 2009 11:17:55 -0800

Thanks for your reply Erik!

The speed of my suggested query is actually very fast once we add the
facet.mincount=1 (when searching within a limited set of documents).
The set-back seem to be in the sharding of our data.. And that puzzles me a
little bit...

I can't really see why SOLR is so slow at doing this.
The scenario:

Let's say we have two servers (s1 and s2).
If i query
the following:
q=threadid:33&facet=true&facet.field=author&limit=-1&facet.mincount=0&rows=0
directly on either server, the response is lightning fast. (<10ms)
So, in theory I could query them directly, concat the result myself and get
that done pretty fast.
But if I introduce the shards parameter, the response time booms to between
15000ms and 20000ms!
shards=s1:8983/solr,s2:8983/solr
My initial thoughts is that I MUST be doing something wrong here?

So I try the following:
Run the query on server s1, with the shards param shards=s1:8983/solr
response time goes from sub 10ms to between 5000ms and 10000ms!
Same results if i run the query on s2, and same if i use shards=s2:8983/solr

Is there really that much overhead in running a distributed facet field
query with Solr? Anyone else experienced this?

On the other hand, running regular queries without facet distributed is
lightning fast... (so can't really see that this is a network problem or
anything either). - and I can't possibly be as I tried running a facet query
on s1 with s1 as the shards param, and that is still as slow as if the
shards param was pointed to a different server...

Any insight into this would be greatly appreciated! (Would like to avoid
having to hack together our own solution concatinating results...)

Cheers,
 Aleks

On Thu, Dec 17, 2009 at 7:36 PM, Erik Hatcher <erik.hatc...@gmail.com>wrote:

>
> On Dec 17, 2009, at 11:59 AM, Aleksander Stensby wrote:
>
>> A follow up question on this Hoss:
>> If I have a set of documents, let's say this email thread. Each email has
>> a
>> unique author. All emails in the thread are indexed with "threadid=33" If
>> I
>> want to count the number of unique authors in this email thread, I could
>> go
>> along the lines you mention at the end:
>> rows=0&threadid=33&facet=true&facet.field=author&limit=-1
>> then count all returned facets. This works, but becomes unfeasable when
>> the
>> number of unique author values in the index is large. Right?
>> So the limit=-1 solution is just not working for such fields. But would
>> work
>> well for "category" if the number of unique categories is low...
>> It's almost faster to retrieve all entries from the thread and count
>> programatically the number of unique authors... But obviouslly, I don't
>> want
>> to do that!
>>
>> So, how would you go about to find the number of unique authors in this
>> scenario?
>>
>
> One possible solution is "tree" faceting:
> https://issues.apache.org/jira/browse/SOLR-792
>
>    &facet.tree=threadid,author
>
> Could be a LARGE amount of data though!
>
>        Erik
>
>

Re: Can solr do the equivalent of "select distinct(field)"?

Reply via email to