QueryParser and stopwords

2004-01-20 Thread Morus Walter
Hi,

I'm currently trying to get rid of query parser problems with stopwords
(depending on the query, there are ArrayIndexOutOfBoundsExceptions,
e.g. for stop AND nonstop where stop is a stopword and nonstop not).

While this isn't hard to fix (I'll enter a bug and patch in bugzilla), 
there's one issue left, I'm not sure how to deal with:

What should the query parser return for a query string containing only
stopwords?

And when I think about this, there's another one:
stop AND NOT nonstop
creates a boolean query, only containing prohibited terms, which
AFAIK cannot be used in a search. How to deal with this?

Currently it returns an empty BooleanQuery.
I think it would be more useful to return null in this case.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query Term Questions

2004-01-20 Thread Terry Steichen
1) Is there a way to set the query boost factor depending not on the presence of a 
term, but on the presence of two specific terms?  For example, I may want to boost the 
relevance of a document that contains both iraq and clerics, but not boost the 
relevance of documents that contain only one or the other terms. (The idea is better 
discrimination than if I simply boosted both terms.)

2) Is it possible to apply (or simulate) a negative query boost factor?  For example, 
I may have a complex query with lots of terms but want to reduce the relevance of a 
matching document that also included the term iowa. ( The idea is for an easier and 
more discriminating way than simply increasing the relevance of all other terms 
besides iowa).  

3) Is there a way to handle variants of a phrase without OR'ing together the variants? 
 For example, I may want to find documents dealing with North Korea; the terms might 
be north korea or north korean or north koreans - is there a way to handle this 
with a single term using wildcards?

Regards,

Terry

Re: mergeFactor and maxMergeDocs

2004-01-20 Thread Doug Cutting
Chong, Herb wrote:
what effect and what recommendations are valid for Lucene 1.3?
Same as always: use the defaults and call optimize() only when you know 
you won't be changing the index for a while.

If you have lots of RAM, increasing minMergeDocs may increase indexing 
speed, but raising it too high may cause out of memory problems.

You may also see some indexing speedup by increasing the mergeFactor, 
but raising it too high will cause file handle problems.  Calling 
setUseCompoundFile() will enable higher mergeFactor settings before 
encountering file handle problems.

Obsession with indexing performance is not healthy.  Before changing any 
settings convince yourself that indexing performance is a real problem 
for your application.  How often do you re-index from scratch?  Are you 
really having any difficulty keeping up with the rate of change of your 
collection?  Perhaps your development time would be better spent 
focussing on other parts of your application.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IndexReader.document(int i)

2004-01-20 Thread Doug Cutting
Nicolas Maisonneuve wrote:
i would like to know  
in the IndexReader.document(int i)
what is this number  i ? 
if the the first document is the oldest document indexed 
and the last the youngest ? (so we can sort by date  easyly) ?
Yes, documents with lower numbers were indexed earlier.  As documents 
are deleted the numbers of other, higher documents shift downwards, but 
the order of document numbers always represents the order that documents 
were added to the index.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Ordening documents

2004-01-20 Thread Doug Cutting
Yes, this is correct.

Peter Keegan wrote:
So they are sorted by reverse document number. Is this the 'external'
document number (the one that is adjusted for the segment's base)? If so,
then this means that documents with equal score are returned in the order in
which they were added to the index.  Is this correct?
Thanks,
Peter
- Original Message - 
From: Morus Walter [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, January 17, 2004 4:57 PM
Subject: Re: Ordening documents



Peter Keegan writes:

What is the returned order for documents with identical scores?
have a look at the source of the lessThan method in
org.java.lucene.search.HitQueue:
protected final boolean lessThan(Object a, Object b) {
   ScoreDoc hitA = (ScoreDoc)a;
   ScoreDoc hitB = (ScoreDoc)b;
   if (hitA.score == hitB.score)
 return hitA.doc  hitB.doc;
   else
 return hitA.score  hitB.score;
}
sorting is done by this method.

HTH
Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Gettting all index fields of an index

2004-01-20 Thread Doug Cutting
Try calling IndexReader.getFieldNames().

Karl Koch wrote:
How can I get a list of all fields in an index from which I know only the
directory string?
Karl



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: mergeFactor and maxMergeDocs

2004-01-20 Thread Chong, Herb
my job is to measure and benchmark for capacity planning purposes. that means knowing 
how much room i have to work with on the tuning knobs.

Herb...

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 20, 2004 12:18 PM
To: Lucene Users List
Subject: Re: mergeFactor and maxMergeDocs


Obsession with indexing performance is not healthy.  Before changing any 
settings convince yourself that indexing performance is a real problem 
for your application.  How often do you re-index from scratch?  Are you 
really having any difficulty keeping up with the rate of change of your 
collection?  Perhaps your development time would be better spent 
focussing on other parts of your application.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-20 Thread Doug Cutting
setMaxClauseCount determines the maximum number of clauses, which is not 
your problem here.  Your problem is with required clauses.  There may 
only be a total of 31 required (or prohibited) clauses in a single 
BooleanQuery.  If you need more, then create more BooleanQueries and 
combine them with another BooleanQuery.  Perhaps this could be done 
automatically, but I've never heard anyone encounter this limit before. 
 Do you really mean for 32 different terms to be required?  Do any 
documents actually match this query?

Doug

Karl Koch wrote:
Hi group,

I run over a IndexOutOfBoundsException:

- java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query.
The reason: I have more then 32 BooleanCauses. From the Mailinglist I got
the info how to set the maxiumum number of clauses higher before a loop:
...
myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
while (true){
  Token token = tokenStream.next();
  if (token == null) {
break;
  }
  myBooleanQuery.add(new TermQuery(new Term(bla, token.termText())), true,
false);
} ... 

However the error still remains, why?

Karl



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: difference in javadoc and faq similarity expression

2004-01-20 Thread Doug Cutting
Nicolas Maisonneuve wrote:
in the Similarity Javadoc

score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
lengthNorm(t.field in d)  * coord(q,d) * queryNorm(q) ]
in the FAQ

score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) *
coord_q_d
In FAQ | In Javadoc
1 / norm_q = queryNorm(q)
1 / norm_d_t=lengthNorm(t.field in d)
coord_q_d=coord(q,d)
boost_t=getBoost(t.field in d)
idf_t=idf(t)
tf_d=tf(t in d)
but
where is the javadoc expression for tf_q faq expression
I think tf_q is always 1.0.  If a term occurs twice in the query then 
Lucene considers them as two terms with tf_q = 1.0 rather than a single 
term with tf_q = 2.0.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Exception thrown from IndexFiles..help!

2004-01-20 Thread othman el moulat




 Hi guys!
 i just downloaded the lucene software and start the tutorial simple =
 demo...
 i tried to execute the IndexFiles.java on a NetBeans envirement on =
 Windowx xp platform...
 
 i replaced args[0] with the path: C:\\lucene-1.2\\src but i have the =
 following exception:
 
  caught a java.lang.ArrayIndexOutOfBounds Exception=20
  with message : null
 
 
 please help me to get started!
 
 i also want to know how to set classpath in a win xp platform...
 
 thanks for your help...
 
 othman.
 
 
 
 -
 
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene search result no stable

2004-01-20 Thread Ardor Wei
Hi, experts,

Our application pulls data out of DB and write as
lucene document every 5 minutes. We use a loop to keep
updating document. But we only use 1 writing thread
and always closes IndexWriter once we finish writing.
Unfortunately, we always got IOException and
FileNotFoundException ( I guess there is some problem
with the lock time setting inside Lucene, though I am
not sure). Our IndexSearcher doesn't do deletion, so
it is thread safety. But the problem is we sometimes
get the results (e.g 10) we expect, sometimes we get
nothing. I traced with other engineers through our
code using Lucene. It looks like it is pretty straight
and correct. But the exceptions keep being thrown and
I think the unstable search result is also related to
these exceptions. 

What might be the problem? How to solve it?
Any suggestion or idea will be appreciated.

Thanks.

__
Do you Yahoo!?
Yahoo! Hotjobs: Enter the Signing Bonus Sweepstakes
http://hotjobs.sweepstakes.yahoo.com/signingbonus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]