Let's say we're trying to do document to document matching (not with
MLT). We have a shingling analysis chain. The query is a document, which
is itself shingled. We then look up those shingles in the index. The %
of shingles found is in some sense a marker as to the extent to which
the documents are similar.

We could also index the number of shingles in a document, and include
that into the overall score, as a short document might be entirely
contained in a larger one but not really be a match.

I would code this in Java as a query object (because some clever person
called Doug wrote an excellent blog post on how to write query objects),
so really the important part is how to do the matching on many terms
efficiently.

Upayavira

On Mon, Nov 2, 2015, at 06:47 PM, Erick Erickson wrote:
> Or a really simple--minded approach, just use the frequency
> as a ration of numFound to estimate terms.
> 
> Doesn't work of course if you need precise counts.
> 
> On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
> <dturnb...@opensourceconnections.com> wrote:
> > How precise do you need to be?
> >
> > I wonder if you could efficiently approximate "number of matches" by
> > getting the document frequency of each term. I realize this is an
> > approximation, but the highest document frequency would be your floor.
> >
> > Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
> > lowest.
> >
> > OK the following algorithm could refine your floor
> > - count = t1.docfreq
> > - Then issue a query for NOT t1, this eliminates many candidate documents
> > to improve performance
> > - Build a bloom filter or other set-membership data structure for t2...tn
> > https://en.wikipedia.org/wiki/Bloom_filter
> > - In a PostFilter(?) Lucene Collector(?) scan each collected/returned
> > document and do a set membership test against the bloom filter. If member,
> > then increment your count.
> >
> > It's O(numDocs that don't match t1)
> >
> > This is me just thinking outloud, but maybe it'll trigger thoughts in
> > others...
> > -Doug
> >
> >
> > On Mon, Nov 2, 2015 at 12:14 PM, Upayavira <u...@odoko.co.uk> wrote:
> >
> >> I have a scenario where I want to search for documents that contain many
> >> terms (maybe 100s or 1000s), and then know the number of terms that
> >> matched. I'm happy to implement this as a query object/parser.
> >>
> >> I understand that Lucene isn't well suited to this scenario. Any
> >> suggestions as to how to make this more efficient? Does the TermsQuery
> >> work differently from the BooleanQuery regarding large numbers of terms?
> >>
> >> Upayavira
> >>
> >
> >
> >
> > --
> > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> > <http://opensourceconnections.com>, LLC | 240.476.9983
> > Author: Relevant Search <http://manning.com/turnbull>
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.

Reply via email to