It's possible that the ReducerStream's buffer can grow too large if
document groups are very large. But the ReducerStream only needs to hold
one group at a time in memory. The RollupStream, in trunk, has a grouping
implementation that doesn't hang on to all the Tuples from a group. You
could also implement a custom stream that does exactly what you need.

The AnalyicsQuery is much more efficient because the data is left in place.
The Streaming API has streaming overhead which is considerable. But it's
the Stream "shuffling" that gives you the power to do things like fully
distributed grouping.

How many records are processed in a typical query and what type of response
time do you need?

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Sep 3, 2015 at 3:25 PM, tedsolr <tsm...@sciquest.com> wrote:

> Thanks Joel, that link looks promising. The CloudSolrStream bypasses my
> issue
> of multiple shards. Perhaps the ReducerStream would provide what I need. At
> first glance I worry that the the buffer would grow too large - if its
> really holding the values for all the fields in each document
> (Tuple.getMaps()). I use a Map in my DelegatingCollector to store the
> "unique" docs, but I only keep the docId, my stats, and the ordinals for
> each field. Would you expect the new streams API to perform as well as my
> implementation of an AnalyticsQuery and a DelegatingCollector?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227034.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to