RE: Joining on multi valued fields

2011-08-04 Thread matthew . fowler
Hi Yonik

So I tested the join using the sample data below and the latest trunk. I still 
got the same behaviour.

HOWEVER! In this case it was nothing to do with the patch or solr version. It 
was the tokeniser splitting G1 into G and 1.

So thank you for a nice patch and your suggestions.

I do have a couple of questions for you: At what level does the join happen and 
what do you expect the performance penalty to be. We might use this extensively 
if the performance penalty isn't great.

Thanks again,

Matt

-Original Message-
From: Fowler, Matthew (Markets Eikon) 
Sent: 03 August 2011 15:04
To: yo...@lucidimagination.com
Cc: solr-user@lucene.apache.org
Subject: RE: Joining on multi valued fields

No I haven't. I will get the latest out of the trunk and report back.

Cheers again,

Matt

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: 03 August 2011 14:51
To: Fowler, Matthew (Markets Eikon)
Cc: solr-user@lucene.apache.org
Subject: Re: Joining on multi valued fields

Hmmm, if these are real responses from a solr server at rest (i.e.
documents not being changed between queries) then what you show
definitely looks like a bug.
That's interesting, since TestJoin implements a random test that
should cover cases like this pretty well.

I assume you are using a version of trunk (4.0-dev) and not just the
actual attached to the JIRA issue (which IIRC had at least one bug...
SOLR-2521).
Have you tried a more recent version of trunk?

-Yonik
http://www.lucidimagination.com



On Wed, Aug 3, 2011 at 7:00 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 Sorry for my late reply. I have been trying to get to the bottom of this
 but I'm getting inconsistent behaviour. Here's an example:

 Query = pi:rcs100     -       Here going to use pid_rcs as join
 value

 result name=response numFound=1 start=0
  doc
  str name=pircs100/str
  str name=ctrcs/str
  str name=pid_rcsG1/str
  str name=name_rcsEmerging Market Countries/str
  str name=definition_rcsAll business events relating to companies
 and other issuers of securities./str
  /doc
  /result
  /response

 Query = code:G1       -       See how many docs have G1 in their
 code field. Notice that code is multi valued

 - result name=response numFound=2 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
  /result
  /response

 Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
 from=pid_rcs to=code}pi:rcs100

 - result name=response numFound=3 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:58Z/date
  str name=pinCN1763203+1029782/str
 - arr name=code
  strA2/str
  strA5/str
  strA9/str
  strAN/str
  strB125/str
  strB126/str
  strB130/str
  strBL63/str
  strG41/str
  strGK/str
  strMZ/str
  /arr
  /doc
  /result
  /response

 So as you can see I get back 3 results when only 2 match the criteria.
 i.e. docs where G1 is present in multi valued code field. Why should
 the last document be included in the result of the join?

 Thank you,

 Matt


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: 01 August 2011 18:28
 To: solr-user@lucene.apache.org
 Subject: Re: Joining on multi valued fields

 On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
 wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

 That should work (and the unit tests do test with multi-valued fields).
 Can you come up with a simple example where you are not getting the
 expected results?

 -Yonik
 http://www.lucidimagination.com

 This email was sent to you by Thomson Reuters, the global news and 
 information company. Any views expressed in this message are those of the 
 individual sender, except where the sender specifically states them to be the 
 views of Thomson Reuters.


This email was sent to you by Thomson Reuters, the global news and information

Re: Joining on multi valued fields

2011-08-04 Thread Yonik Seeley
On Thu, Aug 4, 2011 at 11:21 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 So I tested the join using the sample data below and the latest trunk. I 
 still got the same behaviour.

 HOWEVER! In this case it was nothing to do with the patch or solr version. It 
 was the tokeniser splitting G1 into G and 1.

Ah, glad you figured it out!

 So thank you for a nice patch and your suggestions.

 I do have a couple of questions for you: At what level does the join happen 
 and what do you expect the performance penalty to be. We might use this 
 extensively if the performance penalty isn't great.

With the current implementation, the performance is proportional to
the number of unique terms in the fields being joined.

-Yonik
http://www.lucidimagination.com


RE: Joining on multi valued fields

2011-08-03 Thread matthew . fowler
Hi Yonik

Sorry for my late reply. I have been trying to get to the bottom of this
but I'm getting inconsistent behaviour. Here's an example:

Query = pi:rcs100 -   Here going to use pid_rcs as join
value

result name=response numFound=1 start=0
 doc
  str name=pircs100/str 
  str name=ctrcs/str 
  str name=pid_rcsG1/str 
  str name=name_rcsEmerging Market Countries/str 
  str name=definition_rcsAll business events relating to companies
and other issuers of securities./str 
  /doc
  /result
  /response

Query = code:G1   -   See how many docs have G1 in their
code field. Notice that code is multi valued 

- result name=response numFound=2 start=0
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF3wGpXk+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF7YcLP+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
  /result
  /response

Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
from=pid_rcs to=code}pi:rcs100

- result name=response numFound=3 start=0
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF3wGpXk+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF7YcLP+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:58Z/date 
  str name=pinCN1763203+1029782/str 
- arr name=code
  strA2/str 
  strA5/str 
  strA9/str 
  strAN/str 
  strB125/str 
  strB126/str 
  strB130/str 
  strBL63/str 
  strG41/str 
  strGK/str 
  strMZ/str 
  /arr
  /doc
  /result
  /response

So as you can see I get back 3 results when only 2 match the criteria.
i.e. docs where G1 is present in multi valued code field. Why should
the last document be included in the result of the join?

Thank you,

Matt


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: 01 August 2011 18:28
To: solr-user@lucene.apache.org
Subject: Re: Joining on multi valued fields

On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

That should work (and the unit tests do test with multi-valued fields).
Can you come up with a simple example where you are not getting the
expected results?

-Yonik
http://www.lucidimagination.com

This email was sent to you by Thomson Reuters, the global news and information 
company. Any views expressed in this message are those of the individual 
sender, except where the sender specifically states them to be the views of 
Thomson Reuters.


Re: Joining on multi valued fields

2011-08-03 Thread Yonik Seeley
Hmmm, if these are real responses from a solr server at rest (i.e.
documents not being changed between queries) then what you show
definitely looks like a bug.
That's interesting, since TestJoin implements a random test that
should cover cases like this pretty well.

I assume you are using a version of trunk (4.0-dev) and not just the
actual attached to the JIRA issue (which IIRC had at least one bug...
SOLR-2521).
Have you tried a more recent version of trunk?

-Yonik
http://www.lucidimagination.com



On Wed, Aug 3, 2011 at 7:00 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 Sorry for my late reply. I have been trying to get to the bottom of this
 but I'm getting inconsistent behaviour. Here's an example:

 Query = pi:rcs100     -       Here going to use pid_rcs as join
 value

 result name=response numFound=1 start=0
  doc
  str name=pircs100/str
  str name=ctrcs/str
  str name=pid_rcsG1/str
  str name=name_rcsEmerging Market Countries/str
  str name=definition_rcsAll business events relating to companies
 and other issuers of securities./str
  /doc
  /result
  /response

 Query = code:G1       -       See how many docs have G1 in their
 code field. Notice that code is multi valued

 - result name=response numFound=2 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
  /result
  /response

 Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
 from=pid_rcs to=code}pi:rcs100

 - result name=response numFound=3 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:58Z/date
  str name=pinCN1763203+1029782/str
 - arr name=code
  strA2/str
  strA5/str
  strA9/str
  strAN/str
  strB125/str
  strB126/str
  strB130/str
  strBL63/str
  strG41/str
  strGK/str
  strMZ/str
  /arr
  /doc
  /result
  /response

 So as you can see I get back 3 results when only 2 match the criteria.
 i.e. docs where G1 is present in multi valued code field. Why should
 the last document be included in the result of the join?

 Thank you,

 Matt


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: 01 August 2011 18:28
 To: solr-user@lucene.apache.org
 Subject: Re: Joining on multi valued fields

 On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
 wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

 That should work (and the unit tests do test with multi-valued fields).
 Can you come up with a simple example where you are not getting the
 expected results?

 -Yonik
 http://www.lucidimagination.com

 This email was sent to you by Thomson Reuters, the global news and 
 information company. Any views expressed in this message are those of the 
 individual sender, except where the sender specifically states them to be the 
 views of Thomson Reuters.



RE: Joining on multi valued fields

2011-08-03 Thread matthew . fowler
No I haven't. I will get the latest out of the trunk and report back.

Cheers again,

Matt

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: 03 August 2011 14:51
To: Fowler, Matthew (Markets Eikon)
Cc: solr-user@lucene.apache.org
Subject: Re: Joining on multi valued fields

Hmmm, if these are real responses from a solr server at rest (i.e.
documents not being changed between queries) then what you show
definitely looks like a bug.
That's interesting, since TestJoin implements a random test that
should cover cases like this pretty well.

I assume you are using a version of trunk (4.0-dev) and not just the
actual attached to the JIRA issue (which IIRC had at least one bug...
SOLR-2521).
Have you tried a more recent version of trunk?

-Yonik
http://www.lucidimagination.com



On Wed, Aug 3, 2011 at 7:00 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 Sorry for my late reply. I have been trying to get to the bottom of this
 but I'm getting inconsistent behaviour. Here's an example:

 Query = pi:rcs100     -       Here going to use pid_rcs as join
 value

 result name=response numFound=1 start=0
  doc
  str name=pircs100/str
  str name=ctrcs/str
  str name=pid_rcsG1/str
  str name=name_rcsEmerging Market Countries/str
  str name=definition_rcsAll business events relating to companies
 and other issuers of securities./str
  /doc
  /result
  /response

 Query = code:G1       -       See how many docs have G1 in their
 code field. Notice that code is multi valued

 - result name=response numFound=2 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
  /result
  /response

 Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
 from=pid_rcs to=code}pi:rcs100

 - result name=response numFound=3 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:58Z/date
  str name=pinCN1763203+1029782/str
 - arr name=code
  strA2/str
  strA5/str
  strA9/str
  strAN/str
  strB125/str
  strB126/str
  strB130/str
  strBL63/str
  strG41/str
  strGK/str
  strMZ/str
  /arr
  /doc
  /result
  /response

 So as you can see I get back 3 results when only 2 match the criteria.
 i.e. docs where G1 is present in multi valued code field. Why should
 the last document be included in the result of the join?

 Thank you,

 Matt


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: 01 August 2011 18:28
 To: solr-user@lucene.apache.org
 Subject: Re: Joining on multi valued fields

 On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
 wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

 That should work (and the unit tests do test with multi-valued fields).
 Can you come up with a simple example where you are not getting the
 expected results?

 -Yonik
 http://www.lucidimagination.com

 This email was sent to you by Thomson Reuters, the global news and 
 information company. Any views expressed in this message are those of the 
 individual sender, except where the sender specifically states them to be the 
 views of Thomson Reuters.


This email was sent to you by Thomson Reuters, the global news and information 
company. Any views expressed in this message are those of the individual 
sender, except where the sender specifically states them to be the views of 
Thomson Reuters.


Joining on multi valued fields

2011-08-01 Thread matthew . fowler
Hi List

 

I have been using the JOIN patch
https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 

However I have hit a case where it doesn't seem to be working. It
doesn't seem to work when joining to a multi-valued field.

 

Has anyone any experience using the patch to join on multi-valued
fields?

 

Thanks,

 

Matt


This email was sent to you by Thomson Reuters, the global news and information 
company. Any views expressed in this message are those of the individual 
sender, except where the sender specifically states them to be the views of 
Thomson Reuters.

Re: Joining on multi valued fields

2011-08-01 Thread Yonik Seeley
On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

That should work (and the unit tests do test with multi-valued fields).
Can you come up with a simple example where you are not getting the
expected results?

-Yonik
http://www.lucidimagination.com