RE: Joining on multi valued fields
Hi Yonik So I tested the join using the sample data below and the latest trunk. I still got the same behaviour. HOWEVER! In this case it was nothing to do with the patch or solr version. It was the tokeniser splitting G1 into G and 1. So thank you for a nice patch and your suggestions. I do have a couple of questions for you: At what level does the join happen and what do you expect the performance penalty to be. We might use this extensively if the performance penalty isn't great. Thanks again, Matt -Original Message- From: Fowler, Matthew (Markets Eikon) Sent: 03 August 2011 15:04 To: yo...@lucidimagination.com Cc: solr-user@lucene.apache.org Subject: RE: Joining on multi valued fields No I haven't. I will get the latest out of the trunk and report back. Cheers again, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 03 August 2011 14:51 To: Fowler, Matthew (Markets Eikon) Cc: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields Hmmm, if these are real responses from a solr server at rest (i.e. documents not being changed between queries) then what you show definitely looks like a bug. That's interesting, since TestJoin implements a random test that should cover cases like this pretty well. I assume you are using a version of trunk (4.0-dev) and not just the actual attached to the JIRA issue (which IIRC had at least one bug... SOLR-2521). Have you tried a more recent version of trunk? -Yonik http://www.lucidimagination.com On Wed, Aug 3, 2011 at 7:00 AM, matthew.fow...@thomsonreuters.com wrote: Hi Yonik Sorry for my late reply. I have been trying to get to the bottom of this but I'm getting inconsistent behaviour. Here's an example: Query = pi:rcs100 - Here going to use pid_rcs as join value result name=response numFound=1 start=0 doc str name=pircs100/str str name=ctrcs/str str name=pid_rcsG1/str str name=name_rcsEmerging Market Countries/str str name=definition_rcsAll business events relating to companies and other issuers of securities./str /doc /result /response Query = code:G1 - See how many docs have G1 in their code field. Notice that code is multi valued - result name=response numFound=2 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc /result /response Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join from=pid_rcs to=code}pi:rcs100 - result name=response numFound=3 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:58Z/date str name=pinCN1763203+1029782/str - arr name=code strA2/str strA5/str strA9/str strAN/str strB125/str strB126/str strB130/str strBL63/str strG41/str strGK/str strMZ/str /arr /doc /result /response So as you can see I get back 3 results when only 2 match the criteria. i.e. docs where G1 is present in multi valued code field. Why should the last document be included in the result of the join? Thank you, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 01 August 2011 18:28 To: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields On Mon, Aug 1, 2011 at 12:58 PM, matthew.fow...@thomsonreuters.com wrote: I have been using the JOIN patch https://issues.apache.org/jira/browse/SOLR-2272 with great success. However I have hit a case where it doesn't seem to be working. It doesn't seem to work when joining to a multi-valued field. That should work (and the unit tests do test with multi-valued fields). Can you come up with a simple example where you are not getting the expected results? -Yonik http://www.lucidimagination.com This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters. This email was sent to you by Thomson Reuters, the global news and information
Re: Joining on multi valued fields
On Thu, Aug 4, 2011 at 11:21 AM, matthew.fow...@thomsonreuters.com wrote: Hi Yonik So I tested the join using the sample data below and the latest trunk. I still got the same behaviour. HOWEVER! In this case it was nothing to do with the patch or solr version. It was the tokeniser splitting G1 into G and 1. Ah, glad you figured it out! So thank you for a nice patch and your suggestions. I do have a couple of questions for you: At what level does the join happen and what do you expect the performance penalty to be. We might use this extensively if the performance penalty isn't great. With the current implementation, the performance is proportional to the number of unique terms in the fields being joined. -Yonik http://www.lucidimagination.com
RE: Joining on multi valued fields
Hi Yonik Sorry for my late reply. I have been trying to get to the bottom of this but I'm getting inconsistent behaviour. Here's an example: Query = pi:rcs100 - Here going to use pid_rcs as join value result name=response numFound=1 start=0 doc str name=pircs100/str str name=ctrcs/str str name=pid_rcsG1/str str name=name_rcsEmerging Market Countries/str str name=definition_rcsAll business events relating to companies and other issuers of securities./str /doc /result /response Query = code:G1 - See how many docs have G1 in their code field. Notice that code is multi valued - result name=response numFound=2 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc /result /response Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join from=pid_rcs to=code}pi:rcs100 - result name=response numFound=3 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:58Z/date str name=pinCN1763203+1029782/str - arr name=code strA2/str strA5/str strA9/str strAN/str strB125/str strB126/str strB130/str strBL63/str strG41/str strGK/str strMZ/str /arr /doc /result /response So as you can see I get back 3 results when only 2 match the criteria. i.e. docs where G1 is present in multi valued code field. Why should the last document be included in the result of the join? Thank you, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 01 August 2011 18:28 To: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields On Mon, Aug 1, 2011 at 12:58 PM, matthew.fow...@thomsonreuters.com wrote: I have been using the JOIN patch https://issues.apache.org/jira/browse/SOLR-2272 with great success. However I have hit a case where it doesn't seem to be working. It doesn't seem to work when joining to a multi-valued field. That should work (and the unit tests do test with multi-valued fields). Can you come up with a simple example where you are not getting the expected results? -Yonik http://www.lucidimagination.com This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.
Re: Joining on multi valued fields
Hmmm, if these are real responses from a solr server at rest (i.e. documents not being changed between queries) then what you show definitely looks like a bug. That's interesting, since TestJoin implements a random test that should cover cases like this pretty well. I assume you are using a version of trunk (4.0-dev) and not just the actual attached to the JIRA issue (which IIRC had at least one bug... SOLR-2521). Have you tried a more recent version of trunk? -Yonik http://www.lucidimagination.com On Wed, Aug 3, 2011 at 7:00 AM, matthew.fow...@thomsonreuters.com wrote: Hi Yonik Sorry for my late reply. I have been trying to get to the bottom of this but I'm getting inconsistent behaviour. Here's an example: Query = pi:rcs100 - Here going to use pid_rcs as join value result name=response numFound=1 start=0 doc str name=pircs100/str str name=ctrcs/str str name=pid_rcsG1/str str name=name_rcsEmerging Market Countries/str str name=definition_rcsAll business events relating to companies and other issuers of securities./str /doc /result /response Query = code:G1 - See how many docs have G1 in their code field. Notice that code is multi valued - result name=response numFound=2 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc /result /response Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join from=pid_rcs to=code}pi:rcs100 - result name=response numFound=3 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:58Z/date str name=pinCN1763203+1029782/str - arr name=code strA2/str strA5/str strA9/str strAN/str strB125/str strB126/str strB130/str strBL63/str strG41/str strGK/str strMZ/str /arr /doc /result /response So as you can see I get back 3 results when only 2 match the criteria. i.e. docs where G1 is present in multi valued code field. Why should the last document be included in the result of the join? Thank you, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 01 August 2011 18:28 To: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields On Mon, Aug 1, 2011 at 12:58 PM, matthew.fow...@thomsonreuters.com wrote: I have been using the JOIN patch https://issues.apache.org/jira/browse/SOLR-2272 with great success. However I have hit a case where it doesn't seem to be working. It doesn't seem to work when joining to a multi-valued field. That should work (and the unit tests do test with multi-valued fields). Can you come up with a simple example where you are not getting the expected results? -Yonik http://www.lucidimagination.com This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.
RE: Joining on multi valued fields
No I haven't. I will get the latest out of the trunk and report back. Cheers again, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 03 August 2011 14:51 To: Fowler, Matthew (Markets Eikon) Cc: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields Hmmm, if these are real responses from a solr server at rest (i.e. documents not being changed between queries) then what you show definitely looks like a bug. That's interesting, since TestJoin implements a random test that should cover cases like this pretty well. I assume you are using a version of trunk (4.0-dev) and not just the actual attached to the JIRA issue (which IIRC had at least one bug... SOLR-2521). Have you tried a more recent version of trunk? -Yonik http://www.lucidimagination.com On Wed, Aug 3, 2011 at 7:00 AM, matthew.fow...@thomsonreuters.com wrote: Hi Yonik Sorry for my late reply. I have been trying to get to the bottom of this but I'm getting inconsistent behaviour. Here's an example: Query = pi:rcs100 - Here going to use pid_rcs as join value result name=response numFound=1 start=0 doc str name=pircs100/str str name=ctrcs/str str name=pid_rcsG1/str str name=name_rcsEmerging Market Countries/str str name=definition_rcsAll business events relating to companies and other issuers of securities./str /doc /result /response Query = code:G1 - See how many docs have G1 in their code field. Notice that code is multi valued - result name=response numFound=2 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc /result /response Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join from=pid_rcs to=code}pi:rcs100 - result name=response numFound=3 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:58Z/date str name=pinCN1763203+1029782/str - arr name=code strA2/str strA5/str strA9/str strAN/str strB125/str strB126/str strB130/str strBL63/str strG41/str strGK/str strMZ/str /arr /doc /result /response So as you can see I get back 3 results when only 2 match the criteria. i.e. docs where G1 is present in multi valued code field. Why should the last document be included in the result of the join? Thank you, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 01 August 2011 18:28 To: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields On Mon, Aug 1, 2011 at 12:58 PM, matthew.fow...@thomsonreuters.com wrote: I have been using the JOIN patch https://issues.apache.org/jira/browse/SOLR-2272 with great success. However I have hit a case where it doesn't seem to be working. It doesn't seem to work when joining to a multi-valued field. That should work (and the unit tests do test with multi-valued fields). Can you come up with a simple example where you are not getting the expected results? -Yonik http://www.lucidimagination.com This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters. This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.
Joining on multi valued fields
Hi List I have been using the JOIN patch https://issues.apache.org/jira/browse/SOLR-2272 with great success. However I have hit a case where it doesn't seem to be working. It doesn't seem to work when joining to a multi-valued field. Has anyone any experience using the patch to join on multi-valued fields? Thanks, Matt This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.
Re: Joining on multi valued fields
On Mon, Aug 1, 2011 at 12:58 PM, matthew.fow...@thomsonreuters.com wrote: I have been using the JOIN patch https://issues.apache.org/jira/browse/SOLR-2272 with great success. However I have hit a case where it doesn't seem to be working. It doesn't seem to work when joining to a multi-valued field. That should work (and the unit tests do test with multi-valued fields). Can you come up with a simple example where you are not getting the expected results? -Yonik http://www.lucidimagination.com