Re: Queries for many terms

2015-11-03 Thread Alan Woodward
TermsQuery works by pulling the postings lists for each term and OR-ing them 
together to create a bitset, which is very memory-efficient but means that you 
don't know at doc collection time which term has actually matched.

For your case you probably want to create a SpanOrQuery, and then iterate 
through the resulting Spans in a specialised Collector.  Depending on how many 
terms you want, though, you may end up requiring a lot of memory for the search.

Alan Woodward
www.flax.co.uk


On 2 Nov 2015, at 17:14, Upayavira wrote:

> I have a scenario where I want to search for documents that contain many
> terms (maybe 100s or 1000s), and then know the number of terms that
> matched. I'm happy to implement this as a query object/parser.
> 
> I understand that Lucene isn't well suited to this scenario. Any
> suggestions as to how to make this more efficient? Does the TermsQuery
> work differently from the BooleanQuery regarding large numbers of terms?
> 
> Upayavira



Re: Queries for many terms

2015-11-02 Thread Erick Erickson
Or a really simple--minded approach, just use the frequency
as a ration of numFound to estimate terms.

Doesn't work of course if you need precise counts.

On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
 wrote:
> How precise do you need to be?
>
> I wonder if you could efficiently approximate "number of matches" by
> getting the document frequency of each term. I realize this is an
> approximation, but the highest document frequency would be your floor.
>
> Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
> lowest.
>
> OK the following algorithm could refine your floor
> - count = t1.docfreq
> - Then issue a query for NOT t1, this eliminates many candidate documents
> to improve performance
> - Build a bloom filter or other set-membership data structure for t2...tn
> https://en.wikipedia.org/wiki/Bloom_filter
> - In a PostFilter(?) Lucene Collector(?) scan each collected/returned
> document and do a set membership test against the bloom filter. If member,
> then increment your count.
>
> It's O(numDocs that don't match t1)
>
> This is me just thinking outloud, but maybe it'll trigger thoughts in
> others...
> -Doug
>
>
> On Mon, Nov 2, 2015 at 12:14 PM, Upayavira  wrote:
>
>> I have a scenario where I want to search for documents that contain many
>> terms (maybe 100s or 1000s), and then know the number of terms that
>> matched. I'm happy to implement this as a query object/parser.
>>
>> I understand that Lucene isn't well suited to this scenario. Any
>> suggestions as to how to make this more efficient? Does the TermsQuery
>> work differently from the BooleanQuery regarding large numbers of terms?
>>
>> Upayavira
>>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.


Re: Queries for many terms

2015-11-02 Thread Doug Turnbull
How precise do you need to be?

I wonder if you could efficiently approximate "number of matches" by
getting the document frequency of each term. I realize this is an
approximation, but the highest document frequency would be your floor.

Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
lowest.

OK the following algorithm could refine your floor
- count = t1.docfreq
- Then issue a query for NOT t1, this eliminates many candidate documents
to improve performance
- Build a bloom filter or other set-membership data structure for t2...tn
https://en.wikipedia.org/wiki/Bloom_filter
- In a PostFilter(?) Lucene Collector(?) scan each collected/returned
document and do a set membership test against the bloom filter. If member,
then increment your count.

It's O(numDocs that don't match t1)

This is me just thinking outloud, but maybe it'll trigger thoughts in
others...
-Doug


On Mon, Nov 2, 2015 at 12:14 PM, Upayavira  wrote:

> I have a scenario where I want to search for documents that contain many
> terms (maybe 100s or 1000s), and then know the number of terms that
> matched. I'm happy to implement this as a query object/parser.
>
> I understand that Lucene isn't well suited to this scenario. Any
> suggestions as to how to make this more efficient? Does the TermsQuery
> work differently from the BooleanQuery regarding large numbers of terms?
>
> Upayavira
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Queries for many terms

2015-11-02 Thread Upayavira
Let's say we're trying to do document to document matching (not with
MLT). We have a shingling analysis chain. The query is a document, which
is itself shingled. We then look up those shingles in the index. The %
of shingles found is in some sense a marker as to the extent to which
the documents are similar.

We could also index the number of shingles in a document, and include
that into the overall score, as a short document might be entirely
contained in a larger one but not really be a match.

I would code this in Java as a query object (because some clever person
called Doug wrote an excellent blog post on how to write query objects),
so really the important part is how to do the matching on many terms
efficiently.

Upayavira

On Mon, Nov 2, 2015, at 06:47 PM, Erick Erickson wrote:
> Or a really simple--minded approach, just use the frequency
> as a ration of numFound to estimate terms.
> 
> Doesn't work of course if you need precise counts.
> 
> On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
>  wrote:
> > How precise do you need to be?
> >
> > I wonder if you could efficiently approximate "number of matches" by
> > getting the document frequency of each term. I realize this is an
> > approximation, but the highest document frequency would be your floor.
> >
> > Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
> > lowest.
> >
> > OK the following algorithm could refine your floor
> > - count = t1.docfreq
> > - Then issue a query for NOT t1, this eliminates many candidate documents
> > to improve performance
> > - Build a bloom filter or other set-membership data structure for t2...tn
> > https://en.wikipedia.org/wiki/Bloom_filter
> > - In a PostFilter(?) Lucene Collector(?) scan each collected/returned
> > document and do a set membership test against the bloom filter. If member,
> > then increment your count.
> >
> > It's O(numDocs that don't match t1)
> >
> > This is me just thinking outloud, but maybe it'll trigger thoughts in
> > others...
> > -Doug
> >
> >
> > On Mon, Nov 2, 2015 at 12:14 PM, Upayavira  wrote:
> >
> >> I have a scenario where I want to search for documents that contain many
> >> terms (maybe 100s or 1000s), and then know the number of terms that
> >> matched. I'm happy to implement this as a query object/parser.
> >>
> >> I understand that Lucene isn't well suited to this scenario. Any
> >> suggestions as to how to make this more efficient? Does the TermsQuery
> >> work differently from the BooleanQuery regarding large numbers of terms?
> >>
> >> Upayavira
> >>
> >
> >
> >
> > --
> > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> > , LLC | 240.476.9983
> > Author: Relevant Search 
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.


Queries for many terms

2015-11-02 Thread Upayavira
I have a scenario where I want to search for documents that contain many
terms (maybe 100s or 1000s), and then know the number of terms that
matched. I'm happy to implement this as a query object/parser.

I understand that Lucene isn't well suited to this scenario. Any
suggestions as to how to make this more efficient? Does the TermsQuery
work differently from the BooleanQuery regarding large numbers of terms?

Upayavira