Re: Open-ended range queries

2004-06-11 Thread Erik Hatcher
On Jun 10, 2004, at 10:37 PM, Terry Steichen wrote:
Speaking for myself, only a small number of my code modules currently 
treat
null as the open-ended range query term parameter.  If the syntax 
change
from 'null' -- '*' was deemed otherwise desirable and the syntax 
transition
made very clearly, I could personally adjust to it without too much
difficulty.

I agree that the proposed '*' syntax does seem more logical.  If a 
change to
that syntax were made such that the old null syntax for the upper 
bound
was retained for backward compatibility, such a transition would be
completely painless.
Just to clarify, since Terry's response implies this is not 
understood there is *nothing* special about null currently.  It 
is simply being treated as term text.  So adding special * handling 
would NOT change how null currently works.

In June of 2002 (!) null and NULL (and nULL, Null, etc) were 
removed as being special from what I see in the diff.

Furthermore, to achieve the proposed * handling, you can do this 
yourself now by subclassing QueryParser and overriding getRangeQuery:

  protected Query getRangeQuery(String field, Analyzer analyzer,
String part1, String part2,
boolean inclusive)
  throws ParseException {
  return new RangeQuery(
  *.equals(part1) ? null : new Term(field, part1),
  *.equals(part2) ? null : new Term(field, part2),
  inclusive);
  }
(a little more is needed if you want to keep the date range handling).
Note, you cannot do field:[* TO *] to make it wide-open - RangeQuery 
does not allow this.

My proposal is this (_after_ 1.4 goes final):
  - Add the above logic to QueryParser.
  - Modify RangeQuery.toString to output the * when the term is null, 
and also if the start term is  (RangeQuery's constructor modifies the 
beginning term to  if it is null).

If there are no objections to this plan, I'll add this as a Bugzilla 
issue as a reminder.  I don't want to touch 1.4's codebase - no point 
in adding a feature at this stage that can already be achieved with the 
simple code above.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
At one point it definitely supported null for either term.  I think 
that has been removed/forgotten in the later revisions of the 
QueryParser...

Scott
On Jun 10, 2004, at 1:24 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote:
Actually, QueryParser does support open-ended ranges like :  [term TO 
null].
Doesn't work for the lower end of the range (though that's usually 
less of a
problem).
It supports null?  Are you sure?  If so, I'm very confused about it 
because I don't see where in the grammar it has any special handling 
like that.  Could you show an example that demonstrates this?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Open-ended range queries

2004-06-10 Thread Terry Steichen
Well, I'm using 1.4 RC3 and the null range upper limit works just fine for
searches in two of my fields; one is in the form of a cannonical date (eg,
20040610) and the other is in the form of a padded word count (e.g., 01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 or more
words).

Regards,

Terry

PS: This use of null has worked this way since at least 1.2.  As I recall,
way back when, null also worked as the first term limit (but no longer
does).

- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, June 10, 2004 2:24 PM
Subject: Re: Open-ended range queries


 On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote:
  Actually, QueryParser does support open-ended ranges like :  [term TO
  null].
  Doesn't work for the lower end of the range (though that's usually
  less of a
  problem).

 It supports null?  Are you sure?  If so, I'm very confused about it
 because I don't see where in the grammar it has any special handling
 like that.  Could you show an example that demonstrates this?

 Erik



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open-ended range queries

2004-06-10 Thread Erik Hatcher
On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote:
Well, I'm using 1.4 RC3 and the null range upper limit works just 
fine for
searches in two of my fields; one is in the form of a cannonical date 
(eg,
20040610) and the other is in the form of a padded word count (e.g., 
01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates 
later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 
or more
words).
Ah
It works for you because you have numeric values and lexically null 
is greater than any of them.  It is still using it as a lexical term 
value, and not truly making the end open-ended.

This is why null doesn't work at the beginning for you either.  It's 
just being treated as text, just like your numbers are.

PS: This use of null has worked this way since at least 1.2.  As I 
recall,
way back when, null also worked as the first term limit (but no 
longer
does).
If so, then something serious broke.  I've not the time to check the 
cvs logs on this, but I cannot imagine that we removed something like 
this.  If anyone cares to dig up the diff where we removed/broke this, 
I'd be gracious.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
It looks to me like Revision 1.18 broke it.
On Jun 10, 2004, at 3:26 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote:
Well, I'm using 1.4 RC3 and the null range upper limit works just 
fine for
searches in two of my fields; one is in the form of a cannonical date 
(eg,
20040610) and the other is in the form of a padded word count (e.g., 
01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates 
later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 
or more
words).
Ah
It works for you because you have numeric values and lexically 
null is greater than any of them.  It is still using it as a lexical 
term value, and not truly making the end open-ended.

This is why null doesn't work at the beginning for you either.  It's 
just being treated as text, just like your numbers are.

PS: This use of null has worked this way since at least 1.2.  As I 
recall,
way back when, null also worked as the first term limit (but no 
longer
does).
If so, then something serious broke.  I've not the time to check the 
cvs logs on this, but I cannot imagine that we removed something like 
this.  If anyone cares to dig up the diff where we removed/broke this, 
I'd be gracious.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
Well, I do like the *, but apparently there are some people that are  
using this with the null...

Scott
On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote:
It looks to me like Revision 1.18 broke it.
It seems this could be it:
revision 1.18
date: 2002/06/25 00:05:31;  author: briangoetz;  state: Exp;  lines:  
+62 -33
Support for new range query syntax.  The delimiter is  TO , but is  
optional
for backward compatibility with previous syntax.  If the range  
arguments
match the format supported by  
DateFormat.getDateInstance(DateFormat.SHORT),
then they will be converted into the appropriate date strings a la  
DateField.

Added Field.Keyword constructor for Date-valued arguments.
Optimized DateField.timeToString function.
But geez June 2002 and no one has complained since?
Given that this is so outdated, I'm not sure what the right course of  
action is.  There are lots more Lucene users now than there were then.  
 Would adding NULL back be what folks want?  What about simply an  
asterisk to denote open ended-ness?  [* TO term] or [term TO *]

For completeness, here is the diff:
% cvs diff -u -r 1.17 -r 1.18 QueryParser.jj
Index: QueryParser.jj
===
RCS file:  
/home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/ 
QueryParser.jj,v
retrieving revision 1.17
retrieving revision 1.18
diff -u -r1.17 -r1.18
--- QueryParser.jj  20 May 2002 15:45:43 -  1.17
+++ QueryParser.jj  25 Jun 2002 00:05:31 -  1.18
@@ -65,8 +65,11 @@

 import java.util.Vector;
 import java.io.*;
+import java.text.*;
+import java.util.*;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.analysis.*;
+import org.apache.lucene.document.*;
 import org.apache.lucene.search.*;
 /**
@@ -218,35 +221,30 @@
   private Query getRangeQuery(String field,
   Analyzer analyzer,
-  String queryText,
+  String part1,
+  String part2,
   boolean inclusive)
   {
-// Use the analyzer to get all the tokens.  There should be 1 or  
2.
-TokenStream source = analyzer.tokenStream(field,
-  new  
StringReader(queryText));
-Term[] terms = new Term[2];
-org.apache.lucene.analysis.Token t;
+boolean isDate = false, isNumber = false;

-for (int i = 0; i  2; i++)
-{
-  try
-  {
-t = source.next();
-  }
-  catch (IOException e)
-  {
-t = null;
-  }
-  if (t != null)
-  {
-String text = t.termText();
-if (!text.equalsIgnoreCase(NULL))
-{
-  terms[i] = new Term(field, text);
-}
-  }
+try {
+  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
+  df.setLenient(true);
+  Date d1 = df.parse(part1);
+  Date d2 = df.parse(part2);
+  part1 = DateField.dateToString(d1);
+  part2 = DateField.dateToString(d2);
+  isDate = true;
 }
-return new RangeQuery(terms[0], terms[1], inclusive);
+catch (Exception e) { }
+
+if (!isDate) {
+  // @@@ Add number support
+}
+
+return new RangeQuery(new Term(field, part1),
+  new Term(field, part2),
+  inclusive);
   }
   public static void main(String[] args) throws Exception {
@@ -282,7 +280,7 @@
 | #_WHITESPACE: (   | \t ) 
 }
-DEFAULT SKIP : {
+DEFAULT, RangeIn, RangeEx SKIP : {
   _WHITESPACE
 }
@@ -303,14 +301,28 @@
 | PREFIXTERM:  _TERM_START_CHAR (_TERM_CHAR)* * 
 | WILDTERM:  _TERM_START_CHAR
   (_TERM_CHAR | ( [ *, ? ] ))* 
-| RANGEIN:   [ ( ~[ ] ] )+ ]
-| RANGEEX:   { ( ~[ } ] )+ }
+| RANGEIN_START: [  : RangeIn
+| RANGEEX_START: {  : RangeEx
 }
 Boost TOKEN : {
 NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )?  : DEFAULT
 }
+RangeIn TOKEN : {
+RANGEIN_TO: TO
+| RANGEIN_END: ] : DEFAULT
+| RANGEIN_QUOTED: \ (~[\])+ \
+| RANGEIN_GOOP: (~[  , ] ])+ 
+}
+
+RangeEx TOKEN : {
+RANGEEX_TO: TO
+| RANGEEX_END: } : DEFAULT
+| RANGEEX_QUOTED: \ (~[\])+ \
+| RANGEEX_GOOP: (~[  , } ])+ 
+}
+
 // *   Query  ::= ( Clause )*
 // *   Clause ::= [+, -] [TERM :] ( TERM | ( Query ) )
@@ -387,7 +399,7 @@
 Query Term(String field) : {
-  Token term, boost=null, slop=null;
+  Token term, boost=null, slop=null, goop1, goop2;
   boolean prefix = false;
   boolean wildcard = false;
   boolean fuzzy = false;
@@ -415,12 +427,29 @@
else
  q = getFieldQuery(field, analyzer, term.image);
  }
- | ( term=RANGEIN { rangein=true; } | term=RANGEEX )
+ | ( RANGEIN_START (  
goop1=RANGEIN_GOOP|goop1=RANGEIN_QUOTED )
+ [ RANGEIN_TO ] (  
goop2=RANGEIN_GOOP|goop2=RANGEIN_QUOTED )
+ RANGEIN_END )
+   [ CARAT boost=NUMBER ]
+{
+  if (goop1.kind == RANGEIN_QUOTED)
+goop1.image = goop1.image.substring(1,  

Re: Open-ended range queries

2004-06-10 Thread Terry Steichen
Speaking for myself, only a small number of my code modules currently treat
null as the open-ended range query term parameter.  If the syntax change
from 'null' -- '*' was deemed otherwise desirable and the syntax transition
made very clearly, I could personally adjust to it without too much
difficulty.

I agree that the proposed '*' syntax does seem more logical.  If a change to
that syntax were made such that the old null syntax for the upper bound
was retained for backward compatibility, such a transition would be
completely painless.

Regards,

Terry

- Original Message - 
From: Scott ganyo [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, June 10, 2004 8:57 PM
Subject: Re: Open-ended range queries


 Well, I do like the *, but apparently there are some people that are
 using this with the null...

 Scott

 On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote:

  On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote:
  It looks to me like Revision 1.18 broke it.
 
  It seems this could be it:
 
  revision 1.18
  date: 2002/06/25 00:05:31;  author: briangoetz;  state: Exp;  lines:
  +62 -33
  Support for new range query syntax.  The delimiter is  TO , but is
  optional
  for backward compatibility with previous syntax.  If the range
  arguments
  match the format supported by
  DateFormat.getDateInstance(DateFormat.SHORT),
  then they will be converted into the appropriate date strings a la
  DateField.
 
  Added Field.Keyword constructor for Date-valued arguments.
 
  Optimized DateField.timeToString function.
 
 
  But geez June 2002 and no one has complained since?
 
  Given that this is so outdated, I'm not sure what the right course of
  action is.  There are lots more Lucene users now than there were then.
   Would adding NULL back be what folks want?  What about simply an
  asterisk to denote open ended-ness?  [* TO term] or [term TO *]
 
  For completeness, here is the diff:
 
  % cvs diff -u -r 1.17 -r 1.18 QueryParser.jj
  Index: QueryParser.jj
  ===
  RCS file:
  /home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/
  QueryParser.jj,v
  retrieving revision 1.17
  retrieving revision 1.18
  diff -u -r1.17 -r1.18
  --- QueryParser.jj  20 May 2002 15:45:43 -  1.17
  +++ QueryParser.jj  25 Jun 2002 00:05:31 -  1.18
  @@ -65,8 +65,11 @@
 
   import java.util.Vector;
   import java.io.*;
  +import java.text.*;
  +import java.util.*;
   import org.apache.lucene.index.Term;
   import org.apache.lucene.analysis.*;
  +import org.apache.lucene.document.*;
   import org.apache.lucene.search.*;
 
   /**
  @@ -218,35 +221,30 @@
 
 private Query getRangeQuery(String field,
 Analyzer analyzer,
  -  String queryText,
  +  String part1,
  +  String part2,
 boolean inclusive)
 {
  -// Use the analyzer to get all the tokens.  There should be 1 or
  2.
  -TokenStream source = analyzer.tokenStream(field,
  -  new
  StringReader(queryText));
  -Term[] terms = new Term[2];
  -org.apache.lucene.analysis.Token t;
  +boolean isDate = false, isNumber = false;
 
  -for (int i = 0; i  2; i++)
  -{
  -  try
  -  {
  -t = source.next();
  -  }
  -  catch (IOException e)
  -  {
  -t = null;
  -  }
  -  if (t != null)
  -  {
  -String text = t.termText();
  -if (!text.equalsIgnoreCase(NULL))
  -{
  -  terms[i] = new Term(field, text);
  -}
  -  }
  +try {
  +  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
  +  df.setLenient(true);
  +  Date d1 = df.parse(part1);
  +  Date d2 = df.parse(part2);
  +  part1 = DateField.dateToString(d1);
  +  part2 = DateField.dateToString(d2);
  +  isDate = true;
   }
  -return new RangeQuery(terms[0], terms[1], inclusive);
  +catch (Exception e) { }
  +
  +if (!isDate) {
  +  // @@@ Add number support
  +}
  +
  +return new RangeQuery(new Term(field, part1),
  +  new Term(field, part2),
  +  inclusive);
 }
 
 public static void main(String[] args) throws Exception {
  @@ -282,7 +280,7 @@
   | #_WHITESPACE: (   | \t ) 
   }
 
  -DEFAULT SKIP : {
  +DEFAULT, RangeIn, RangeEx SKIP : {
 _WHITESPACE
   }
 
  @@ -303,14 +301,28 @@
   | PREFIXTERM:  _TERM_START_CHAR (_TERM_CHAR)* * 
   | WILDTERM:  _TERM_START_CHAR
 (_TERM_CHAR | ( [ *, ? ] ))* 
  -| RANGEIN:   [ ( ~[ ] ] )+ ]
  -| RANGEEX:   { ( ~[ } ] )+ }
  +| RANGEIN_START: [  : RangeIn
  +| RANGEEX_START: {  : RangeEx
   }
 
   Boost TOKEN : {
   NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )?  : DEFAULT
   }
 
  +RangeIn TOKEN : {
  +RANGEIN_TO