Re: Handling intersection facets of many values

2014-11-20 Thread Toke Eskildsen
On Wed, 2014-11-19 at 23:53 +0100, Peter Sturge wrote:
 Yes, the 'lots-of-booleans' thing is a bit prohibitive as it won't
 realistically scale to large value sets.

large is extremely relative in Solr Land, but I would be weary of
going beyond 10K.

 127.0.0.1:8983/solr/net/select?q=*:*fl=destfl=srcfacet=truefq={!join
 from=addr to=dest
 fromIndex=targets}*facet.field=srcfacet.field=destfacet.mincount=1facet.limit=-1facet.sort=countrows=0

Ah! fromIndex. I missed that. Thanks for following up with the full
solution.

- Toke Eskildsen, State and University Library, Denmark




Re: Handling intersection facets of many values

2014-11-20 Thread Michael Sokolov
If you're willing to write some Java you can do something more efficient 
by intersecting two terms enumerations: this works with constant memory 
for any number of values in two fields, basically like intersecting any 
two sorted lists, you leap frog between them.  I have an example if 
you're interested (I was finding compounds by indexing shingles and 
intersecting with regular word terms), but there isn't any support for 
using it in a query, or as part of Solr: it's just an offline kind of 
thing you can run against your index.


-Mike


On 11/19/2014 5:53 PM, Peter Sturge wrote:

Hi Toke,
Yes, the 'lots-of-booleans' thing is a bit prohibitive as it won't
realistically scale to large value sets.

I've been wrestling with joins this evening and have managed to get these
working - and it works very nicely - and across cores (although not shards
yet afaik)!

For anyone looking to do this sort of facet intersecting, here's my query:
127.0.0.1:8983/solr/net/select?q=*:*fl=destfl=srcfacet=truefq={!join
from=addr to=dest
fromIndex=targets}*facet.field=srcfacet.field=destfacet.mincount=1facet.limit=-1facet.sort=countrows=0

Thanks,
Peter


On Wed, Nov 19, 2014 at 9:23 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:


Peter Sturge [peter.stu...@gmail.com] wrote:

I guess you mean take the 1k or so values and build a boolean query from
them?

Not really. Let me try again:

1) Perform a facet call with facet.limit=-1 on dest to get the relevant
dest values.
The result will always be 1000 values or less. Take those values and
construct a filter query a OR b OR c.

2) Perform a facet call on addr with the original query + the newly
constructed filter query.
The facet response should not contain the intersection.

1000 is a bit close to the default limit for boolean queries, so you might
want to raise that.


I'm also looking at creating a custom QueryParser that would build the
relevant DocLists, then intersect them and return the values, [...]

You are describing a Join in Solr and that would likely solve your
problem, but it does not work across cores. Is it possible to have both the
addr and dest data in the same core?

- Toke Eskildsen





Handling intersection facets of many values

2014-11-19 Thread Peter Sturge
Hi Solr Group,

Got an interesting use case (to me, at least), perhaps someone could give
some insight on how best to achieve this?

I've got a core that has about 7million entries, with a field call 'addr'.
By definition, every entry has a unique 'addr' value, so there are 7million
unique values for this field.
I then have another core with ~20million entries. These have a field called
'dest', and there may be, say around 800-1000 unique values for 'dest', but
there's always a value - the number of unique values varies.

So..the problem is this:
What is the best/only/most efficient way to consutruct a search where by I
get back an (ideally faceted) list of values for 'dest' that occur in
'addr'?
Can I do this with just faceting (e.g. facet query or similar)? Or do I
need grouping?
Note, I don't actually need the documents themselves, only the list of
unique values that are the intersection of 'dest' and 'addr'.

Can anyone help shed some light on how best to do this?

Many thanks,
Peter


RE: Handling intersection facets of many values

2014-11-19 Thread Toke Eskildsen
Peter Sturge [peter.stu...@gmail.com] wrote:

[addr 7M unique, dest 1K unique]

 What is the best/only/most efficient way to consutruct a search where by I
 get back an (ideally faceted) list of values for 'dest' that occur in
 'addr'?

I assume the actual values are defined by a query? As the number of possible 
values in dest is not that large, extracting those first and then using them as 
a filter when searching for addr seems like a fairly efficient way of solving 
the problem.

- Toke Eskildsen


Re: Handling intersection facets of many values

2014-11-19 Thread Peter Sturge
Hi Toke,
Thanks for your input.

I guess you mean take the 1k or so values and build a boolean query from
them?
If that's not what you mean, my apologies..
I'd thought of doing that - the trouble I had was
the unique values could be 20k, or 15,167 or any arbirary and potentially
high-ish number - it's not really known and can/will change over time. I
believe a boolean query with more than 1024 ops can blow up the query, so
scalability is a concern.
The other issue is how this would yield the unique facet values -
e.g. dest=8.8.8.8 (17) [i.e. 8.8.8.8 is in the 'addr' list and occurs 17
times in entries with a 'dest' field] - in fact, I need the uniques
value(s) ('8.8.8.8') more than I need the count ('17')

I could get the facet list of 'dest' values, then trawl through each one,
but this will be a complicated and time-consuming client-side operation.
I'm also looking at creating a custom QueryParser that would build the
relevant DocLists, then intersect them and return the values, but I
wouldn't want to reinvent the wheel if possible, given that facets already
build unique term lists, seems so close - I guess it's like taking two
facet lists (1 for addr, 1 for dest), intersecting them and returning the
result:

List 1:
a
b
c
d
e
f

List 2:
a
a
g
z
c
c
c
e

Resultant intersection:
a (2)
c (3)
e (1)


Thanks,
Peter



On Wed, Nov 19, 2014 at 7:16 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Peter Sturge [peter.stu...@gmail.com] wrote:

 [addr 7M unique, dest 1K unique]

  What is the best/only/most efficient way to consutruct a search where by
 I
  get back an (ideally faceted) list of values for 'dest' that occur in
  'addr'?

 I assume the actual values are defined by a query? As the number of
 possible values in dest is not that large, extracting those first and then
 using them as a filter when searching for addr seems like a fairly
 efficient way of solving the problem.

 - Toke Eskildsen



RE: Handling intersection facets of many values

2014-11-19 Thread Toke Eskildsen
Peter Sturge [peter.stu...@gmail.com] wrote:
 I guess you mean take the 1k or so values and build a boolean query from
 them?

Not really. Let me try again:

1) Perform a facet call with facet.limit=-1 on dest to get the relevant dest 
values.
The result will always be 1000 values or less. Take those values and construct 
a filter query a OR b OR c.

2) Perform a facet call on addr with the original query + the newly constructed 
filter query.
The facet response should not contain the intersection.

1000 is a bit close to the default limit for boolean queries, so you might want 
to raise that.

 I'm also looking at creating a custom QueryParser that would build the
 relevant DocLists, then intersect them and return the values, [...]

You are describing a Join in Solr and that would likely solve your problem, but 
it does not work across cores. Is it possible to have both the addr and dest 
data in the same core?

- Toke Eskildsen


Re: Handling intersection facets of many values

2014-11-19 Thread Peter Sturge
Hi Toke,
Yes, the 'lots-of-booleans' thing is a bit prohibitive as it won't
realistically scale to large value sets.

I've been wrestling with joins this evening and have managed to get these
working - and it works very nicely - and across cores (although not shards
yet afaik)!

For anyone looking to do this sort of facet intersecting, here's my query:
127.0.0.1:8983/solr/net/select?q=*:*fl=destfl=srcfacet=truefq={!join
from=addr to=dest
fromIndex=targets}*facet.field=srcfacet.field=destfacet.mincount=1facet.limit=-1facet.sort=countrows=0

Thanks,
Peter


On Wed, Nov 19, 2014 at 9:23 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Peter Sturge [peter.stu...@gmail.com] wrote:
  I guess you mean take the 1k or so values and build a boolean query from
  them?

 Not really. Let me try again:

 1) Perform a facet call with facet.limit=-1 on dest to get the relevant
 dest values.
 The result will always be 1000 values or less. Take those values and
 construct a filter query a OR b OR c.

 2) Perform a facet call on addr with the original query + the newly
 constructed filter query.
 The facet response should not contain the intersection.

 1000 is a bit close to the default limit for boolean queries, so you might
 want to raise that.

  I'm also looking at creating a custom QueryParser that would build the
  relevant DocLists, then intersect them and return the values, [...]

 You are describing a Join in Solr and that would likely solve your
 problem, but it does not work across cores. Is it possible to have both the
 addr and dest data in the same core?

 - Toke Eskildsen