Phrase-based (vs. Word-Based) Proximity Search

2007-11-12 Thread Chris Harris
I gather that the standard Solr query parser uses the same syntax for
proximity searches as Lucene, and that Lucene syntax is described at

  http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches

This syntax lets me look for terms that are within x words of each
other. Their example is that

  jakarta apache~10

will find documents where jakarta and apache occur within 10 words
of one another.

What I would like to do is is find documents where *phrases*, not just
terms, are within x words of each other. I want to be able to say
things like

  Find the documents where the phrases apache jakarta and sun
microsystems occur within ten words
  of one another.

If I gave such a search, I would *not* want it to count as a match if,
for instance, apache appeared near microsystems but apache
wasn't followed immediately by jakarta, or microsystems wasn't
preceded immediately by sun. I would also not want it to match if
apache jakarta appeared, but sun microsystems did not appear.

Is there any way to do such a search currently? I suppose it might work to say

  apache jakarta sun microsystems~10 +apache jakarta +sun microsystems

but that seems like an unfortunate hack. In any case it's not really
something I can expect my users to be able to type in by themselves.
In our current query language (which I'm hoping to wean our users off
of), they can type

  apache jakarta near/10 sun microsystems

which I believe is more intuitive.

Any ideas?

Chris


Re: Phrase-based (vs. Word-Based) Proximity Search

2007-11-12 Thread Ken Krugler

Hi Chris,


I gather that the standard Solr query parser uses the same syntax for
proximity searches as Lucene, and that Lucene syntax is described at

http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches

This syntax lets me look for terms that are within x words of each
other. Their example is that

  jakarta apache~10

will find documents where jakarta and apache occur within 10 words
of one another.

What I would like to do is is find documents where *phrases*, not just
terms, are within x words of each other. I want to be able to say
things like

  Find the documents where the phrases apache jakarta and sun
microsystems occur within ten words
  of one another.


[snip]

I'd thought that span queries would allow you to do this type of 
thing, but they're not supported (currently) by the standard query 
parser.


E.g. check out the SpanNearQuery support in (recent) Lucene releases:

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/spans/SpanNearQuery.html

I would recommend re-posting this on the Lucene user list.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it