Hi Matt,

I think your concern about performance is spot-on, though.

The combinatorial explosion would be at query time, not at index time - my 
solution has a single token indexed per document. My suggested query-time 
filter would generate the following number of output terms, where C(n,k) is the 
combination of n things taken k at a time, n is the number of input query 
terms, and k is the number of concatenated input query terms forming one output 
query term:

    C(n,1)+C(n,2)...+C(n,n-1)+C(n,n)

For small queries this would not be a problem:

        1 input query term -> 1 output query term
        2 input query terms -> 3 output query terms
        3 input query terms -> 7 output query terms
        4 input query terms -> 15 output query terms

But for larger queries, it could be fairly expensive:

        10 input query terms -> 1,023 output query terms
        ...
        15 input query terms -> 32,767 output query terms

This is exactly (2^n - 1) output query terms, where n is the number of input 
terms.

32k query terms might be too slow to be functional.

Steve

> -----Original Message-----
> From: Matthew Hall [mailto:mh...@informatics.jax.org]
> Sent: Tuesday, October 26, 2010 3:51 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How do I this in Solr?
> 
> Bah.. nope this would miss documents that only match a subset of the
> given terms.
> 
> I'm going to have to go with Steven's approach as the right choice here.
> 
> Matt
> 
> On 10/26/2010 3:44 PM, Matthew Hall wrote:
> > Indeed, I'd missed the second part of his requirements, my and
> > solution is sadly insufficient to this task.
> >
> > The combinatorial part of you solution worries me a bit though Steven,
> > because his documents that are on the larger side of his corpus would
> > likely slow down query performance a bit while the filter calculates
> > all of the possibilities for a given document.
> >
> > I'm wondering if a slightly hybrid approach would be valid:
> >
> > Have a filter that calculates the total number of terms for a given
> > document.  And then add a clause into your query at runtime that would
> > match what the filter would come up with:
> >
> > So:
> >
> > text:"Nokia" AND text:"Mobile" AND text:"GPS" AND termCount: 3
> >
> > Something like that anyhow.
> >
> > Matt
> >
> > On 10/26/2010 3:35 PM, Dennis Gearon wrote:
> >> I'm the LAST person anyone will ever need to worry about flame
> >> baiting. You did notice that I retracted what I said and supported
> >> your point of view?
> >>
> >> Sorry if my cryptic comment sounded critical. I was wrong, you were
> >> right :-)
> >> Dennis Gearon
> >>
> >> Signature Warning
> >> ----------------
> >> It is always a good idea to learn from your own mistakes. It is
> >> usually a better idea to learn from others’ mistakes, so you do not
> >> have to make them yourself. from
> >> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >>
> >> EARTH has a Right To Life,
> >>    otherwise we all die.
> >>
> >>
> >> --- On Tue, 10/26/10, Steven A Rowe<sar...@syr.edu>  wrote:
> >>
> >>> From: Steven A Rowe<sar...@syr.edu>
> >>> Subject: RE: How do I this in Solr?
> >>> To: "solr-user@lucene.apache.org"<solr-user@lucene.apache.org>
> >>> Date: Tuesday, October 26, 2010, 12:27 PM
> >>> Hi Dennis,
> >>>
> >>> You wrote:
> >>>> If Solr is like Google, once documents matching only
> >>> the ANDed items
> >>>> in the query ran out, then those that had only two of
> >>> the terms, then
> >>>> only 1 of the terms, and then those close to it would
> >>> start showing up.
> >>> [...]
> >>>> Plus, if he wants terms that contain ONLY those words,
> >>> and no others, an
> >>>> ANDed query would not do that, right? ANDed queries
> >>> return results that
> >>>> must have ALL the terms listed, and could have lots of
> >>> other words, right?
> >>>
> >>> This is *exactly* what I just said: ANDed queries (i.e.,
> >>> requiring all query terms) will not satisfy Varun's
> >>> requirements.
> >>>
> >>> Your participation in this thread looks an awful lot like
> >>> flame-bating: Someone else asks a question, I answer with a
> >>> possible solution, you give a one-word "overkill" response,
> >>> I say why it's not overkill.  You then ask if anybody
> >>> knows the answer to the original question, and then parrot
> >>> my response to your "overkill" statement.  Really????
> >>>
> >>> Get your shit together or shut up.  Please.
> >>>
> >>> Steve
> >>>
> >>>> -----Original Message-----
> >>>> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> >>>> Sent: Tuesday, October 26, 2010 3:14 PM
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: RE: How do I this in Solr?
> >>>>
> >>>>
> >>>>
> >>>> Dennis Gearon
> >>>>
> >>>> Signature Warning
> >>>> ----------------
> >>>> It is always a good idea to learn from your own
> >>> mistakes. It is usually a
> >>>> better idea to learn from others’ mistakes, so you
> >>> do not have to make
> >>>> them yourself. from
> >>>> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >>>> EARTH has a Right To Life,
> >>>>     otherwise we all die.
> >>>>
> >>>>
> >>>> --- On Tue, 10/26/10, Steven A Rowe<sar...@syr.edu>
> >>> wrote:
> >>>>> From: Steven A Rowe<sar...@syr.edu>
> >>>>> Subject: RE: How do I this in Solr?
> >>>>> To: "solr-user@lucene.apache.org"
> >>> <solr-user@lucene.apache.org>
> >>>>> Date: Tuesday, October 26, 2010, 12:10 PM
> >>>>> Dennis,
> >>>>>
> >>>>> Do you mean to say that you read my earlier post,
> >>> and
> >>>>> disagree that it would solve the problem?  Or
> >>> have you
> >>>>> simply not read it?
> >>>>>
> >>>>> Steve
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> >>>>>> Sent: Tuesday, October 26, 2010 3:00 PM
> >>>>>> To: solr-user@lucene.apache.org
> >>>>>> Subject: RE: How do I this in Solr?
> >>>>>>
> >>>>>> Good point. Since I might need such a query
> >>> myself
> >>>>> someday, how *IS* that
> >>>>>> done?
> >>>>>>
> >>>>>>
> >>>>>> Dennis Gearon
> >>>>>>
> >>>>>> Signature Warning
> >>>>>> ----------------
> >>>>>> It is always a good idea to learn from your
> >>> own
> >>>>> mistakes. It is usually a
> >>>>>> better idea to learn from others’
> >>> mistakes, so you
> >>>>> do not have to make
> >>>>>> them yourself. from
> >>>>>> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >>>>>> EARTH has a Right To Life,
> >>>>>>     otherwise we all die.
> >>>>>>
> >>>>>>
> >>>>>> --- On Tue, 10/26/10, Steven A Rowe<sar...@syr.edu>
> >>>>> wrote:
> >>>>>>> From: Steven A Rowe<sar...@syr.edu>
> >>>>>>> Subject: RE: How do I this in Solr?
> >>>>>>> To: "solr-user@lucene.apache.org"
> >>>>> <solr-user@lucene.apache.org>
> >>>>>>> Date: Tuesday, October 26, 2010, 11:46
> >>> AM
> >>>>>>> Um, maybe I'm way off base, but when
> >>>>>>> Varun said:
> >>>>>>>
> >>>>>>>> If I search with the text "samsung
> >>> andriod
> >>>>> GPS",
> >>>>>>>> search results should only conain
> >>> "samsung",
> >>>>> "GPS",
> >>>>>>>> "andriod" and "samsung andriod".
> >>>>>>> I interpreted that to mean that hit
> >>> documents
> >>>>> should
> >>>>>>> contain terms from the query, and
> >>> nothing else.
> >>>>> Making
> >>>>>>> all terms required doesn't do this.
> >>>>>>>
> >>>>>>> Steve
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Matthew Hall [mailto:mh...@informatics.jax.org]
> >>>>>>>> Sent: Tuesday, October 26, 2010
> >>> 2:30 PM
> >>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>> Subject: Re: How do I this in
> >>> Solr?
> >>>>>>>> Um.. you could change your default
> >>> clause to
> >>>>> AND
> >>>>>>> rather than or.
> >>>>>>>> That should do the trick.
> >>>>>>>>
> >>>>>>>> Matt
> >>>>>>>>
> >>>>>>>> On 10/26/2010 2:26 PM, Dennis
> >>> Gearon wrote:
> >>>>>>>>> Overkill?
> >>>>>>>>>
> >>>>>>>>> Dennis Gearon
> >>>>>>>>>> I can't think of a way to
> >>> do it
> >>>>> without
> >>>>>>> writing new
> >>>>>>>>>> analysis filters.
> >>>>>>>>>>
> >>>>>>>>>> But I think you could do
> >>> what you
> >>>>> want with
> >>>>>>> two filters
> >>>>>>>>>> (this is untested):
> >>>>>>>>>>
> >>>>>>>>>> 1. An index-time filter
> >>> that
> >>>>> outputs a single
> >>>>>>> token
> >>>>>>>>>> consisting of all of the
> >>> input
> >>>>> tokens, sorted
> >>>>>>> in a
> >>>>>>>>>> consistent way, e.g.:
> >>>>>>>>>>
> >>>>>>>>>>       "mobile with
> >>> GPS"
> >>>>>>> ->   "GPS mobile
> >>>>>>>>>> with"
> >>>>>>>>>>       "samsung
> >>> android"
> >>>>>>> ->   "android
> >>>>>>>>>> samsung"
> >>>>>>>>>>
> >>>>>>>>>> 2. A query-time filter
> >>> that outputs
> >>>>> one token
> >>>>>>> per input
> >>>>>>>>>> term combination, sorted
> >>> in the
> >>>>> same
> >>>>>>> consistent way as the
> >>>>>>>>>> index-time filter, e.g.:
> >>>>>>>>>>
> >>>>>>>>>>       "samsung andriod
> >>>>>>> GPS"
> >>>>>>>>>>         ->
> >>>>>>>>>>
> >>> "samsung","android","GPS",
> >>>>>>>>>>            "android
> >>>>>>>>>> samsung","GPS
> >>> samsung","android
> >>>>> GPS"
> >>>>>>>>>>            "android
> >>>>>>> GPS
> >>>>>>>>>> samsung"
> >>>>>>>>>>
> >>>>>>>>>> Steve
> >>>>>>>>>>
> >>>>>>>>>>> -----Original
> >>> Message-----
> >>>>>>>>>>> From: Varun Gupta
> >>> [mailto:varun.vgu...@gmail.com]
> >>>>>>>>>>> Sent: Tuesday,
> >>> October 26, 2010
> >>>>> 9:08 AM
> >>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>> Subject: How do I
> >>> this in
> >>>>> Solr?
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I have lot of small
> >>> documents
> >>>>> (each
> >>>>>>> containing 1 to 15
> >>>>>>>>>> words) indexed in
> >>>>>>>>>>> Solr. For the search
> >>> query, I
> >>>>> want the
> >>>>>>> search results
> >>>>>>>>>> to contain only
> >>>>>>>>>>> those
> >>>>>>>>>>> documents that
> >>> satisfy this
> >>>>> criteria "All
> >>>>>>> of the words
> >>>>>>>>>> of the search
> >>>>>>>>>>> result
> >>>>>>>>>>> document are present
> >>> in the
> >>>>> search
> >>>>>>> query"
> >>>>>>>>>>> For example:
> >>>>>>>>>>> If I have the
> >>> following
> >>>>> documents
> >>>>>>> indexed: "nokia
> >>>>>>>>>> n95", "GPS", "android",
> >>>>>>>>>>> "samsung", "samsung
> >>> andriod",
> >>>>> "nokia
> >>>>>>> andriod", "mobile
> >>>>>>>>>> with GPS"
> >>>>>>>>>>> If I search with the
> >>> text
> >>>>> "samsung
> >>>>>>> andriod GPS",
> >>>>>>>>>> search results should
> >>>>>>>>>>> only
> >>>>>>>>>>> conain "samsung",
> >>> "GPS",
> >>>>> "andriod" and
> >>>>>>> "samsung
> >>>>>>>>>> andriod".
> >>>>>>>>>>> Is there a way to do
> >>> this in
> >>>>> Solr.
> >>>>>>>>>>> --
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> Varun Gupta
> >>>>>>>
> >
> >
> 
> 
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mh...@informatics.jax.org
> (207) 288-6012
> 

Reply via email to