Hi Edward,

Thanks a lot for your reply!

Subquery is what I had in mind, too, for designs 1) and 3) to bring
back the other side of the relationship. Except that I always queries
movies and subqueried users.

If I do it the other way around, like you did, then I'm able to
filter. I can't quite filter by date (didn't manage to get parameter
substitution AND date parsing to work), but if I index dates are UNIX
timestamps (in a plong field), then it works with something like:

q=family:Smith&fl=*,movies:[subquery]&movies.q={!terms f=id
v=$row.watched_movies}&movies.fq={!frange l=$row.born}release_date

The only trouble is, now I can't facet on movie metadata and I can't
get the total number of movies. Or maybe I can and I don't know how?
Besides going through all results (e.g. via a streaming expression).

Still, for getting top N results, this should work. Thanks again for the idea!

Sharding is doable, in a few ways that I can see:
- if data is "normalized", then the "subquery" side has to be a single
shard replicated to all nodes. This won't work if the subquery looks
at movies, because there are tons of movies
- but I'm thinking I could use the new XCJF query parser and have
everything sharded: https://issues.apache.org/jira/browse/SOLR-13749
- or, if I denormalize like in 3) then I could throw everything in one
collection and route by movie ID. Or keep movies and users in separate
collections and still route by movie ID. But this means more expensive
updates/indexing :(

Best regards,
Radu

On Wed, Feb 5, 2020 at 1:00 AM Edward Ribeiro <edward.ribe...@gmail.com> wrote:
>
> Just for the sake of an imagined scenario, you could use the [subquery] doc
> transformer. A query like the one below:
>
> /select?q=family: Smith&fq=watched_movies:[* TO *]&fl=*,
> movies:[subquery]&movies.q={!terms f=id v=$row.watched_movies}
>
> Would bring back the results below:
>
> { "responseHeader":{
>     "status":0,
>     "QTime":0,
>     "params":{
>       "movies.q":"{!terms f=id v=$row.watched_movies}",
>       "q":"family: Smith",
>       "fl":"*, movies:[subquery]",
>       "fq":"watched_movies:[* TO *]"}},
>   "response":{"numFound":2,"start":0,"docs":[
>       {
>         "id":"user_1",
>         "name":["Jane"],
>         "family":["Smith"],
>         "born":["1990-01-01T00:00:00Z"],
>         "watched_movies":["1",
>           "3"],
>         "_version_":1657646162820202496,
>         "movies":{"numFound":2,"start":0,"docs":[
>             {
>               "id":"1",
>               "title":["Rambo 1"],
>               "release_date":["1978-01-01T00:00:00Z"],
>               "_version_":1657646123722997760},
>             {
>               "id":"3",
>               "title":["300 Spartaaaaaans"],
>               "release_date":["2005-01-01T00:00:00Z"],
>               "_version_":1657646123726143488}]
>         }},
>       {
>         "id":"user_2",
>         "title":["Joe"],
>         "family":["Smith"],
>         "born":["1970-01-01T00:00:00Z"],
>         "watched_movies":["2"],
>         "_version_":1657646162827542528,
>         "movies":{"numFound":1,"start":0,"docs":[
>             {
>               "id":"2",
>               "title":["Rambo 5"],
>               "release_date":["1998-01-01T00:00:00Z"],
>               "_version_":1657646123725094912}]
>         }}]
>   }}
>
> But I wasn't able to filter on date (I could filter a specific date
> using movies.fq={!term f=release_date v=2005-01-01T00:00:00Z} but not on
> range) nor could I perform facets in the children of the above example. It
> probably only works on a single node too. Finally, there a couple of
> parameters that can be important but that I ommited for the sake of brevity
> and clarity: movies.limit=100 and movies.sort=release_date DESC
>
>
> Best,
> Edward
>
>
>
> On Tue, Feb 4, 2020 at 11:17 AM Radu Gheorghe <radu.gheor...@sematext.com>
> wrote:
> >
> > Hello Solr users,
> >
> > How would you design a filtered join scenario?
> >
> > Say I have a bunch of movies (excuse any inaccuracies, this is an
> > imagined scenario):
> >
> > curl -XPOST -H 'Content-Type: application/json'
> > 'localhost:8983/solr/test/update?commitWithin=1000' --data-binary '
> > [{
> > "id": "1",
> > "title": "Rambo 1",
> > "release_date": "1978-01-01"
> > },
> > {
> > "id": "2",
> > "title": "Rambo 5",
> > "release_date": "1998-01-01"
> > },
> > {
> > "id": "3",
> >     "title": "300 Spartaaaaaans",
> > "release_date": "2005-01-01"
> > }]'
> >
> > And a bunch of users of certain families who watched those movies:
> >
> > curl -XPOST -H 'Content-Type: application/json'
> > 'localhost:8983/solr/test/update?commitWithin=1000' --data-binary '
> > [{
> > "id": "user_1",
> > "name": "Jane",
> > "family": "Smith",
> > "born": "1990-01-01",
> > "watched_movies": ["1", "3"]
> > },
> > {
> > "id": "user_2",
> > "title": "Joe",
> > "family": "Smith",
> > "born": "1970-01-01",
> > "watched_movies": ["2"]
> > },
> > {
> > "id": "user_3",
> > "title": "Radu",
> > "family": "Gheorghe,
> > "born": "1985-01-01",
> > "watched_movies": ["1", "2", "3"]
> > }]'
> >
> > They don't have to be in the same collection. The important question
> > is how to get:
> > - movies watched by user of family Smith
> > - after they were born
> > - including the matching users
> > - I'd like to be able to facet on movie metadata, but I don't need to
> > facet on user metadata, just to be able to retrieve those fields
> >
> > The above query should bring back Rambo 5 and 300, with Joe and Jane
> > respectively. I wouldn't get Rambo 1, because although Jane watched
> > it, the movie was released before she was born.
> >
> > Here are some options that I have in mind:
> > 1) using the join query parser (or the newer XCJF) to do the join
> > itself. Then have some sort of plugin pull the "born" value or each
> > corresponding user (via some subquery) and filter movies afterwards.
> > Normalized, but likely painfully slow
> >
> > 2) similar approach with 1), in a streaming expression. Again,
> > normalized, but slow (we're talking billions of movies, millions of
> > users). And limited support for facets.
> >
> > 3) have some sort of denormalization. For example, pre-compute
> > matching users for every movie, then just use join/XCJF to do the
> > actual join. This makes indexing/updates expensive and potentially
> > complicated
> >
> > 4) normalization with nested documents. This is best for searches, but
> > pretty much a no-go for indexing/updates. In this imaginary use-case,
> > there are binge-watchers who might watch a billion movies in a week,
> > making us reindex everything
> >
> > Do you see better ways?
> >
> > Thanks in advance and best regards,
> > Radu

Reply via email to