Right now we’re sharding the collection as we hit performance issues in the 
past with legacy Solr (i.e. a single Solr core), and also we’re experimenting a 
bit to see which replication factor we can get away with (in terms of resources 
and cost). Unfortunately, PSQL isn’t yet an option due to the lack of point 
field support, which we’re using in our schema 
(https://issues.apache.org/jira/browse/SOLR-10427).

Thanks for pointing at the parallel function. What I don’t understand, though, 
is if I don’t use the parallel decorator, my query isn’t distributed across my 
cluster nodes (e.g. I have four shards and no replicas)?


> On 22 Feb 2018, at 03:01, Joel Bernstein <joels...@gmail.com> wrote:
> 
> With Streaming Expressions you have options for speeding up large
> aggregations.
> 
> 1) Shard
> 2) Use the parallel function to run the aggregation in parallel.
> 3) Add more replicas
> 
> When you use the parallel function the same aggregation can be pulled from
> every shard and every shard replica in the cluster.
> 
> The parallel SQL interface supports a map_reduce aggregation mode where you
> can specific then number of parallel workers. If a SQL group by query works
> for you that might be the easiest way to go. The docs have good coverage of
> this topic.
> 
> 
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Wed, Feb 21, 2018 at 8:43 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 2/21/2018 12:08 PM, Alfonso Muñoz-Pomer Fuentes wrote:
>>> Some more details about my collection:
>>> - Approximately 200M documents
>>> - 1.2M different values in the field I’m faceting over
>>> 
>>> The query I’m doing is over a single bucket, which after applying q and
>> fq the 1.2M values are reduced to, at most 60K (often times half that
>> value). From your replies I assume I’m not going to hit a bottleneck any
>> time soon. Thanks a lot.
>> 
>> Two hundred million documents is going to be a pretty big index even if
>> the documents are small.  The server is going to need a lot of spare
>> memory (not assigned to programs) for good general performance.
>> 
>> As I understand it, facet performance is going to be heavily determined
>> by the 1.2 million unique values in the field you're using.  Facet
>> performance is probably going to be very similar whether your query
>> matches 60K or 1 million.
>> 
>> Thanks,
>> Shawn
>> 
>> 

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer

Reply via email to