Once you've scaled up the export from collection4 you can test the
performance of the join by moving the NullStream around the join.

parallel(null(innerJoin(collection 3, collection4)))

Again you'll want to test with different numbers of workers and replicas to
see where you max out performance of the join.


Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 2, 2017 at 10:25 AM, Joel Bernstein <joels...@gmail.com> wrote:

> innerJoin(intersect(innerJoin(collection1, collection2),
>                                innerJoin(collection 3, collection4)),
>                 collection5)
>
> Let's focus on:
>
> innerJoin(collection 3, collection4))
>
> The first thing to focus on is how fast is the export from collection4.
> You can test this with the NullStream with the following construct:
>
> null(search(collection4))
>
> The null stream will eat all the tuples and report back timing
> information. This will isolate the performance of the export from
> collection4.
>
> Once you have a baseline for how fast you can export from a single node,
> you can test with parallel export from a single node:
>
> parallel(null(search(collection4)))
>
> Then you can add replicas for collection4 and increase workers.
>
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Jun 1, 2017 at 11:51 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Which version of solr are you on?
>> Increasing memory may not be useful as streaming API does not keep stuff
>> in
>> memory (except may be hash joins).
>> Increasing replicas (not sharding) and pushing the join computation on
>> worker solr cluster with #workers > 1 would definitely make things faster.
>> Are you limiting your results at some cutoff? if yes, then SOLR-10698
>> <https://issues.apache.org/jira/browse/SOLR-10698> can be useful fix.
>> Also
>> binary response format for streaming would be faster. (available in 6.5
>> probably)
>>
>>
>>
>> On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
>> ecethiagu2...@yahoo.co.in.invalid> wrote:
>>
>> > We are working on a proposal and feeling streaming API along with export
>> > handler will best fit for our usecases. We are already of having a
>> > structure in solr in which we are using graph queries to produce
>> > hierarchical structure. Now from the structure we need to join couple of
>> > more collections.         We have 5 different collections.
>> >           Collection 1- 800 k records.
>> > Collection 2- 200k records.
>>  Collection 3
>> > - 7k records.                                       Collection 4 - 6
>> > million records.                             Collection 5 - 150 k
>> records
>> >                             we are using the below strategy
>> >             innerJoin( intersect( innerJoin(collection 1,collection 2),
>> > innerJoin(Collection 3, Collection 4)), collection 5).
>> >                We are seeing performance is too slow when we start
>> having
>> > collection 4. Just with collection 1 2 5 the results are coming in 2
>> secs.
>> > The moment I have included collection 4 in the query I could see  a
>> > performance impact. I believe exporting large results from collection 4
>> is
>> > causing the issie. Currently I am using single sharded collection with
>> no
>> > replica. I thinking if we can increase the memory as first option to
>> > increase performance as processing doc values need more memory. Then if
>> > that did not worked I can check using parallel stream/ sharding. Kindly
>> > advise is there could be anything else I  missing?
>> > Sent from Yahoo Mail on Android
>>
>
>

Reply via email to