Re: Solr 6 Distributed Join

Joel Bernstein Thu, 17 Dec 2015 07:34:20 -0800

Below is an example of nested joins where the innerJoin is done in parallel
using the parallel function. The partitionKeys parameter needs to be added
to the searches when the parallel function is used to partition the results
across worker nodes.


hashJoin(
                parallel(workerCollection,
                            innerJoin(
                                            search(users, q="*:*",
fl="userId, full_name, hometown", sort="userId asc", zkHost="zk2:2345",
qt="/export" partitionKeys="userId"),
                                            search(reviews, q="*:*",
fl="userId, review, score", sort="userId asc", zkHost="zk1:2345",
qt="/export" partitionKeys="userId"),
                                            on="userId"
                                            ),
                             workers="20",
                             zkHost="zk1:2345",
                             sort="userId asc"
                             ),
               hashed=search(restaurants, q="city:nyc",
fl="restaurantId, restaurantName",
sort="restaurantId asc", zkHost="zk1:2345", qt="/export"),
               on="restaurantId"
)


Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein <joels...@gmail.com> wrote:

> The innerJoin joins two streams sorted by the same join keys (merge join).
> If third stream has the same join keys you can nest innerJoins. But all
> three tables need to be sorted by the same join keys to nest innerJoins
> (merge joins).
>
> innerJoin(innerJoin(...),
>                 search(...),
>                 on...)
>
> If the third stream is joined on a different key you can nest inside a
> hashJoin which doesn't require streams to be sorted on the join key. For
> example:
>
> hashJoin(innerJoin(...),
>                 hashed=search(...),
>                 on..)
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed <ahmed...@uk.ibm.com> wrote:
>
>> Hi again,
>>
>> I got the join to work. A team mate pointed out that one of the search
>> functions in the innerJoin query was missing a field in the join - adding
>> the e1 field to the fl parameter of the second search function gave the
>> result I expected:
>>
>>
>> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>>
>> , fl="id", q=text:John, sort="id
>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
>> fl="id,e1", q=text:Friends, sort="id
>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>>
>> I am still interested in whether we can specify a join, using an arbitrary
>> number of searches.
>>
>> Cheers
>>
>> Akiel
>>
>>
>>
>> From:   Akiel Ahmed/UK/IBM@IBMGB
>> To:     solr-user@lucene.apache.org
>> Date:   16/12/2015 17:05
>> Subject:        Re: Solr 6 Distributed Join
>>
>>
>>
>> Hi Dennis,
>>
>> Thank you for your help. I used your explanation to construct an innerJoin
>>
>> query; I think I am getting further but didn't get the results I expected.
>>
>> The following describes what I did – is there any chance you can tell
>> where I am going wrong:
>>
>> Solr 6 Developer Builds: #2738 and #2743
>>
>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema so it
>>
>> reads:
>>
>> <?xml version="1.0" encoding="UTF-8" ?>
>> <schema name="search" version="1.5">
>>   <uniqueKey>id</uniqueKey>
>>   <field name="id" type="id" indexed="true" stored="true" required="true"
>> multiValued="false" docValues="true"/>
>>   <field name="_version_" type="solr_version" indexed="true" stored="true"
>>
>> required="false" multiValued="false" docValues="true"/>
>>   <field name="type" type="id" indexed="true" stored="true"
>> required="false" multiValued="false" docValues="true"/>
>>   <field name="e1" type="id" indexed="true" stored="true" required="false"
>>
>> multiValued="false" docValues="true"/>
>>   <field name="e2" type="id" indexed="true" stored="true" required="false"
>>
>> multiValued="false" docValues="true"/>
>>   <field name="text" type="free_text" indexed="true" stored="true"
>> required="false" multiValued="false"/>
>>   <fieldType name="id" class="solr.StrField" sortMissingLast="true"/>
>>   <fieldType name="solr_version" class="solr.TrieLongField"
>> precisionStep="0" positionIncrementGap="0"/>
>>   <fieldType name="free_text" class="solr.TextField"
>> positionIncrementGap="100">
>>     <analyzer>
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="lang/stopwords_en.txt"/>
>>     </analyzer>
>>   </fieldType>
>> </schema>
>>
>> 2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml,
>> adding the following near the bottom of the file so it is the last request
>>
>> handler
>>
>>   <requestHandler name="/stream" class="solr.StreamHandler">
>>         <lst name="invariants">
>>                 <str name="wt">json</str>
>>                 <str name="distrib">false</str>
>>         </lst>
>>   </requestHandler>
>>
>> 3. Used solr -e cloud to setup a solr cloud instance, picking all the
>> defaults except I chose basic_configs
>>
>> 4. After solr is running I ingested the following data via the Solr Web UI
>>
>> (/update handler, Document Type = CSV)
>> id,type,e1,e2,text
>> 1,ABC,,,John Smith
>> 2,ABC,,,Jane Smith
>> 3,ABC,,,MiKe Smith
>> 4,ABC,,,John Doe
>> 5,ABC,,,Jane Doe
>> 6,ABC,,,MiKe Doe
>> 7,ABC,,,John Smith
>> 8,DEF,,,Chicken Burger
>> 9,DEF,,,Veggie Burger
>> 10,DEF,,,Beef Burger
>> 11,DEF,,,Chicken Donar
>> 12,DEF,,,Chips
>> 13,DEF,,,Drink
>> 20,GHI,1,2,Friends
>> 21,GHI,3,4,Friends
>> 22,GHI,5,6,Friends
>> 23,GHI,7,6,Friends
>> 24,GHI,6,4,Friends
>> 25,JKL,1,8,Order
>> 26,JKL,2,9,Order
>> 27,JKL,3,10,Order
>> 28,JKL,4,11,Order
>> 29,JKL,5,12,Order
>> 30,JKL,6,13,Order
>>
>> 5. Navigating to the following URL in a browser returned an expected
>> result:
>> http://localhost:8983/solr/gettingstarted/select?q={!join from=id
>> to=e1}text:John&fl="id"
>>
>> <response>
>> ...
>>   <result>
>>     <doc>
>>       <str name="id">20</str>
>>       <str name="e1">1</str>
>>       <str name="e2">2</str>
>>       ...
>>     </doc>
>>     <doc>
>>       <str name="id">28</str>
>>       <str name="e1">4</str>
>>       <str name="e2">11</str>
>>       ...
>>     </doc>
>>     <doc>
>>       <str name="id">23</str>
>>       <str name="e1">7</str>
>>       <str name="e2">6</str>
>>       ...
>>     </doc>
>>   </result>
>> </response>
>>
>> 6. Navigating to the following URL in a browser does NOT return what I
>> expected:
>>
>> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>>
>> , fl="id", q=text:John, sort="id
>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
>> fl="id", q=text:Friends, sort="id
>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>>
>> {"result-set":{"docs":[
>> {"EOF":true,"RESPONSE_TIME":124}]}}
>>
>>
>> I also have a join related question. Is there any chance I can specify a
>> query and join for more than 2 things. For example:
>>
>> innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1,
>>           search(gettingstarted, fl="id", q=text:Chicken, ...) as s2
>>           search(gettingstarted, fl="id", q=text:Friends, ...) as s3)
>>           on="s1.id=s3.e1",
>>           on="s2.id=s3.e2")
>>
>> Sorry if the query does not make sense, but given the data above my
>> intention is to find a single result made up of 3 documents:
>> s1.id=1,s2.id=8,s3.id=25
>> Is that possible? If yes, will Solr 6 support an arbitrary number of
>> queries and associated joins?
>>
>> Cheers
>>
>> Akiel
>>
>>
>>
>> From:   Dennis Gove <dpg...@gmail.com>
>> To:     Akiel Ahmed/UK/IBM@IBMGB, solr-user@lucene.apache.org
>> Date:   11/12/2015 15:34
>> Subject:        Re: Solr 6 Distributed Join
>>
>>
>>
>> Akiel,
>>
>> Without seeing your full url I assume that you're missing the
>> stream=innerJoin(.....) part of it. A full sample url would look like this
>> http://localhost:8983/solr/careers/stream?stream=innerJoin(search(careers
>> ,
>> fl="personId,companyId,title", q=companyId:*, sort="companyId
>> asc",zkHost="localhost:2181",qt="/export"),search(companies,
>> fl="id,companyName", q=*:*, sort="id
>> asc",zkHost="localhost:2181",qt="/export"),on="companyId=id")
>>
>> This example will return a join of career records with the company name
>> for
>> all career records with a non-null companyId.
>>
>> And the pieces have the following meaning:
>> http://localhost:8983/solr/careers/stream?  - you have a collection
>> called
>> careers available on localhost:8983 and you're hitting its stream handler
>> ?stream=  - you are passing the stream parameter to the stream handler
>> zkHost="localhost:2181"  - there is a zk instance running on
>> localhost:2181
>> where solr can get clusterstate information. Note, that since you're
>> sending the request to the careers collection this param is not required
>> in
>> the search(careers....) part but is required in the search(companies....)
>> part. For simplicity I usually just provide it for all.
>> qt="/export"  - tells solr to use the export handler. this assumes all
>> your
>> fields are in docValues. if you'd rather not use the export handler then
>> you probably want to provide the rows=##### param to tell solr to return a
>> large # of rows for each underlying search. Without it solr will default
>> to, I believe, 10 rows.
>>
>> CCing the user list so others can see this as well.
>>
>> We're working on additional documentation for Streaming Aggregation and
>> Expressions. The page can be found at
>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>> but
>> it's missing a lot of things we've added recently.
>>
>> - Dennis
>>
>> On Fri, Dec 11, 2015 at 9:51 AM, Akiel Ahmed <ahmed...@uk.ibm.com> wrote:
>>
>> > Hi,
>> >
>> > Sorry, this is out of the blue - I have joined the Solr mailing list,
>> but
>> > I don't know if that it is the correct place to ask my question. If you
>> are
>> > not the best person to talk to can you please point me in the right
>> > direction.
>> >
>> > I want to try using the Solr 6 distributed joins but cant find enough
>> > material on the web to make it work. I have added the stream handler to
>> my
>> > solrconfig.xml (see below) and when issuing an inner join query (see
>> below)
>> > I get a an error - the localparm named stream is missing so I get a
>> > NullPointerException. Is there a way to play with the join via the Solr
>> web
>> > UI, or if not do you have a code snippet via a SolrJ client that
>> performs a
>> > join?
>> >
>> > solrconfig.xml
>> >
>> > <requestHandler name="/stream" class="solr.StreamHandler">
>> >         <lst name="invariants">
>> >                 <str name="wt">json</str>
>> >                 <str name="distrib">false</str>
>> >         </lst>
>> > </requestHandler>
>> >
>> > query
>> > innerJoin(
>> >         search(getting_started, _search_field:john),
>> >         search(getting_started, _search_field:friends),
>> >         on="id=_link_from_id")
>> >
>> > Cheers
>> >
>> > Akiel
>> > Unless stated otherwise above:
>> > IBM United Kingdom Limited - Registered in England and Wales with number
>> > 741598.
>> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
>> 3AU
>> >
>>
>>
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number
>> 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>>
>>
>>
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number
>> 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>>
>>
>

Re: Solr 6 Distributed Join

Reply via email to