bq:   Of the 10k docs,
most have a unique near duplicate hash value, so there are about 10k unique
values for the field that I'm grouping on.

I suspect (but don't know the grouping code well) that this is the issue.
You're
getting the top N groups, right? But in the general case, you can't insure
that the
topN from shard1 has any relation to the topN from shard2. So I _suspect_
that
the code returns all of the groups. Say that shard1 for group 5 has 3 docs,
but
for shard2 has 3,000 docs. Do get the true top N, you need to collate all
the values
from all the groups; you can't just return the top 10 groups from each
shard and
get correct counts.

Since your group cardinality is about 10K/shard, you're pushing 10 packets
each
containing 10K entries back to the originating shard, which has to
combine/sort
them all to get the true top N. At least that's my theory.

Your situation is special in that you say that your groups don't appear on
more than
one shard, so you'd probably have to write something that aborted this
behavior and
returned only the top N, if I'm right.

But that begs the question of why you're doing this. What purpose is served
by
grouping on documents that probably only have 1 member?

Best,
Erick


On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano <
dtroi...@basistech.com> wrote:

> Hello,
>
> I'm hitting a performance issue when using field collapsing in a
> distributed Solr setup and I'm wondering if others have seen it and if
> anyone has an idea to work around. it.
>
> I'm using field collapsing to deduplicate documents that have the same near
> duplicate hash value, and deduplicating at query time (as opposed to
> filtering at index time) is a requirement.  I have a sharded setup with 10
> cores (not SolrCloud), each having ~1000 documents each.  Of the 10k docs,
> most have a unique near duplicate hash value, so there are about 10k unique
> values for the field that I'm grouping on.  The grouping parameters that
> I'm using are:
>
> group=true
> group.field=<near dupe hash field>
> group.main=true
>
> I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
> difference is the absence or presence of these three grouping parameters
> and I'm consistently seeing a marked difference in performance (as a
> representative data point, 200ms latency without grouping and 1600ms with
> grouping).  Interestingly, if I put all 10k docs on the same core and query
> that core independently with and without grouping, I don't see much of a
> latency difference, so the performance degradation seems to exist only in
> the sharded setup.
>
> Is there a known performance issue when field collapsing in a sharded setup
> (perhaps only manifests when the grouping field has many unique values), or
> have other people observed this?  Any ideas for a workaround?  Note that
> docs in my sharded setup can only have the same signature if they're in the
> same shard, so perhaps that can be used to boost perf, though I don't see
> an exposed way to do so.
>
> A follow-on question is whether we're likely to see the same issue if /
> when we move to SolrCloud.
>
> Thanks,
> Dave
>

Reply via email to