Re: code works with 1.3-rc1 but not with 1.3-final??

2004-03-23 Thread Julien Nioche
Or set a big value with minMergeDocs on IndexWriter and keep a low
mergeFactor (ie 10). You'll have a small number of files on your disk and
the indexing should be faster as well.

- Original Message -
From: Matt Quail [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, March 23, 2004 4:22 AM
Subject: Re: code works with 1.3-rc1 but not with 1.3-final??


 Or use IndexWriter.setUseCompundFile(true) to reduce the number of files
 created by Lucene.


http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#setUseCompoundFile(boolean)

 =Matt

 Kevin A. Burton wrote:

  Dan wrote:
 
  I have some code that creates a lucene index. It has been working fine
  with lucene-1.3-rc1.jar but I wanted to upgrade to
  lucene-1.3-final.jar. I did this and the indexer breaks. I get the
  following error when running the index with 1.3-final:
 
  Optimizing the index
  IOException: /home/danl001/index-Mar-22-14_31_30/_ni.f43 (Too many
  open files)
  Indexed 884 files in 8 directories
  Index creation took 242 seconds
  %
 
  No... it's you... ;)
 
  Read the FAQ and then run
 
  ulimit -n 100 or so...
 
  You need to increase your file handles.  Chance are you never noticed
  this before but the problem was still present.  If you're on a Linux box
  you would be amazed to find out that you're only about 200 file handles
  away from running out of your per-user quota file quota.
 
  You might have to su as root to change this.. RedHat is more strict
  because it uses the glibc resource restrictions thingy. (who's name
  slips my mind at the moment).
  Debian is configured better here as per defaults.
 
  Also a google query would have solved this for you very quickly ;)..
 
  Kevin
 




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber
Hallo,

I run in following problem. Perhaps somebody can help me.

I have a index with different ids in the same field
something like

s
s45678565
s87854546

Situation: I have different documents with the entry s in the same
index.


document 1)

s324235678565
s324dssd5678565
s45678324565
s
s8785454324326


document 2)

s324235678565
s
s45678324565
s8785454324326



when I search for   s:   I receive both docs, but document 1 has a
better scoring than document 2.
The position of s in doc 1 is Field[4] and in doc 2 it's Field[2],
so this seems to effect scoring.

How can I disable this behaviour, so doc 1 has the same scoring as doc 2???
Which method do I have to overwrite in DefaultSimilarity.
Has anybody any idea, any help.

Thanks

yo







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Terry Steichen
Joachim,

I believe you'll have to replace the default Similarity class with one of
your own.  Not sure exactly what the settings should be - maybe some other
list members can give you specifics.  Otherwise, you'll probably have to
experiment with it.

Regards,

Terry

- Original Message -
From: Joachim Schreiber [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 23, 2004 10:05 AM
Subject: Similarity - position in Field[] effects scoring - how to change?


 Hallo,

 I run in following problem. Perhaps somebody can help me.

 I have a index with different ids in the same field
 something like

 s
 s45678565
 s87854546

 Situation: I have different documents with the entry s in the
same
 index.


 document 1)

 s324235678565
 s324dssd5678565
 s45678324565
 s
 s8785454324326


 document 2)

 s324235678565
 s
 s45678324565
 s8785454324326



 when I search for   s:   I receive both docs, but document 1 has
a
 better scoring than document 2.
 The position of s in doc 1 is Field[4] and in doc 2 it's
Field[2],
 so this seems to effect scoring.

 How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
 Which method do I have to overwrite in DefaultSimilarity.
 Has anybody any idea, any help.

 Thanks

 yo







 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Julien Nioche
Joachim,

Why don't you use the method explain of IndexSearcher?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
cher.html

This is the best way to find why your documents are different. I suspect the
lengthNorm  method, which is used at indexation time.

Julien


- Original Message -
From: Joachim Schreiber [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 23, 2004 4:05 PM
Subject: Similarity - position in Field[] effects scoring - how to change?


 Hallo,

 I run in following problem. Perhaps somebody can help me.

 I have a index with different ids in the same field
 something like

 s
 s45678565
 s87854546

 Situation: I have different documents with the entry s in the
same
 index.


 document 1)

 s324235678565
 s324dssd5678565
 s45678324565
 s
 s8785454324326


 document 2)

 s324235678565
 s
 s45678324565
 s8785454324326



 when I search for   s:   I receive both docs, but document 1 has
a
 better scoring than document 2.
 The position of s in doc 1 is Field[4] and in doc 2 it's
Field[2],
 so this seems to effect scoring.

 How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
 Which method do I have to overwrite in DefaultSimilarity.
 Has anybody any idea, any help.

 Thanks

 yo







 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber
Thanks to Daniel the solutions is quite simple.

Use the latest cvs src from the head and try the new sorting feature, it
works very well ;-)

This should be documented anywhere, perhaps in the wiki !

cool new feature!

yo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Ype Kingma
On Tuesday 23 March 2004 16:05, Joachim Schreiber wrote:
 Hallo,

 I run in following problem. Perhaps somebody can help me.

 I have a index with different ids in the same field
 something like

 s
 s45678565
 s87854546

 Situation: I have different documents with the entry s in the
 same index.


 document 1)

 s324235678565
 s324dssd5678565
 s45678324565
 s
 s8785454324326


 document 2)

 s324235678565
 s
 s45678324565
 s8785454324326



 when I search for   s:   I receive both docs, but document 1 has
 a better scoring than document 2.

Since the s field of document 2 is shorter, I'd expect document 2 to score 
higher. As mentioned, lengthNorm() is responsible for this.
Something does not add up here. Are the documents in the same index?

 The position of s in doc 1 is Field[4] and in doc 2 it's
 Field[2], so this seems to effect scoring.

Lucene's default scoring is independent of absolute term positions.

 How can I disable this behaviour, so doc 1 has the same scoring as doc 2???

Simply ignore the score. The easiest way is to use the low level scoring API
with your own HitCollector. Just make sure not to retrieve document field
values until you collected all your hits.

 Which method do I have to overwrite in DefaultSimilarity.
 Has anybody any idea, any help.

In which order to you want the resulting documents presented?
The low level api gives them in index order when the query consists
of single search term, afaik.

Regards,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber

 Why don't you use the method explain of IndexSearcher?

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
 cher.html

 This is the best way to find why your documents are different. I suspect
the
 lengthNorm  method, which is used at indexation time.

Yes but i think this is not a good choice because we have to receive all
docs.
this is not possible because i have hits with 300 000 and more


yo


 Julien


  Hallo,
 
  I run in following problem. Perhaps somebody can help me.
 
  I have a index with different ids in the same field
  something like
 
  s
  s45678565
  s87854546
 
  Situation: I have different documents with the entry s in the
 same
  index.
 
 
  document 1)
 
  s324235678565
  s324dssd5678565
  s45678324565
  s
  s8785454324326
 
 
  document 2)
 
  s324235678565
  s
  s45678324565
  s8785454324326
 
 
 
  when I search for   s:   I receive both docs, but document 1
has
 a
  better scoring than document 2.
  The position of s in doc 1 is Field[4] and in doc 2 it's
 Field[2],
  so this seems to effect scoring.
 
  How can I disable this behaviour, so doc 1 has the same scoring as doc
 2???
  Which method do I have to overwrite in DefaultSimilarity.
  Has anybody any idea, any help.
 
  Thanks
 
  yo
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber
Terry,


 I believe you'll have to replace the default Similarity class with one of
 your own.  Not sure exactly what the settings should be - maybe some other
 list members can give you specifics.  Otherwise, you'll probably have to
 experiment with it.

I tried the new sort feature from cvs and it works well !

But it's interesting, nobody knows exactly how scoring works (seems to me)
;-)

thanks

yo



 Regards,

 Terry

 - Original Message -
 From: Joachim Schreiber [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, March 23, 2004 10:05 AM
 Subject: Similarity - position in Field[] effects scoring - how to change?


  Hallo,
 
  I run in following problem. Perhaps somebody can help me.
 
  I have a index with different ids in the same field
  something like
 
  s
  s45678565
  s87854546
 
  Situation: I have different documents with the entry s in the
 same
  index.
 
 
  document 1)
 
  s324235678565
  s324dssd5678565
  s45678324565
  s
  s8785454324326
 
 
  document 2)
 
  s324235678565
  s
  s45678324565
  s8785454324326
 
 
 
  when I search for   s:   I receive both docs, but document 1
has
 a
  better scoring than document 2.
  The position of s in doc 1 is Field[4] and in doc 2 it's
 Field[2],
  so this seems to effect scoring.
 
  How can I disable this behaviour, so doc 1 has the same scoring as doc
 2???
  Which method do I have to overwrite in DefaultSimilarity.
  Has anybody any idea, any help.
 
  Thanks
 
  yo
 
 
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber
 On Tuesday 23 March 2004 16:05, Joachim Schreiber wrote:
  Hallo,
 
  I run in following problem. Perhaps somebody can help me.
 
  I have a index with different ids in the same field
  something like
 
  s
  s45678565
  s87854546
 
  Situation: I have different documents with the entry s in the
  same index.
 
 
  document 1)
 
  s324235678565
  s324dssd5678565
  s45678324565
  s
  s8785454324326
 
 
  document 2)
 
  s324235678565
  s
  s45678324565
  s8785454324326
 
 
 
  when I search for   s:   I receive both docs, but document 1
has
  a better scoring than document 2.

 Since the s field of document 2 is shorter, I'd expect document 2 to score
 higher. As mentioned, lengthNorm() is responsible for this.
 Something does not add up here. Are the documents in the same index?

  The position of s in doc 1 is Field[4] and in doc 2 it's
  Field[2], so this seems to effect scoring.

 Lucene's default scoring is independent of absolute term positions.


hm...

  How can I disable this behaviour, so doc 1 has the same scoring as doc
2???

 Simply ignore the score. The easiest way is to use the low level scoring
API
 with your own HitCollector. Just make sure not to retrieve document field
 values until you collected all your hits.

you think its possible to order by e.g. date field without retrieving all
the values from the index??


  Which method do I have to overwrite in DefaultSimilarity.
  Has anybody any idea, any help.

 In which order to you want the resulting documents presented?
 The low level api gives them in index order when the query consists
 of single search term, afaik.

in index order is ok but not very flexibel

Regards,
yo


 Regards,
 Ype


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query syntax on Keyword field question

2004-03-23 Thread Chad Small
Hello,
 
How can I format a query to get a hit?
 
I'm using the StandardAnalyzer() at both index and search time.
 
If I'm indexing a field like this:
 
luceneDocument.add(Field.Keyword(category,HW-NCI_TOPICS));

I've tried the following with no success:
 
//  String searchArgs = HW\\-NCI_TOPICS;
//  String searchArgs = HW\\-NCI_TOPICS.toLowerCase();
//  String searchArgs = +HW+NCI+TOPICS;
  //this works with .Text field
//  String searchArgs = +hw+nci+topics;
//  String searchArgs = hw nci topics;
 
thanks,
chad.


RE: SpanXXQuery Usage

2004-03-23 Thread Jochen Frey
Terry,

With regular queries (non-Span-queries) you cannot request that results of
OR / AND / NOT operations are near to one another (i.e. (A or B) near (C or
D)). The span queries solve that problem by allowing any span query to be
used in a SpanNearQuery (and vice versa). There are other applications for
this as well, but this is one of them.

Hope that helps to get you started. Examples for the use can be found in the
unit tests (TestBasics.java, I believe).

Cheers,
Jochen

-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 22, 2004 3:37 AM
To: Lucene Users List
Subject: Re: SpanXXQuery Usage

Otis,

Can you give me/us a rough idea of what these are supposed to do?  It's hard
to extrapolate the terse unit test code into much of a general notion.  I
searched the archives with little success.

Regards,

Terry

- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 22, 2004 2:46 AM
Subject: Re: SpanXXQuery Usage


 Only in unit tests, so far.

 Otis

 --- Terry Steichen [EMAIL PROTECTED] wrote:
  Is there any documentation (other than that in the source) on how to
  use the new SpanxxQuery features?  Specifically: SpanNearQuery,
  SpanNotQuery, SpanFirstQuery and SpanOrQuery?
 
  Regards,
 
  Terry
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Query syntax on Keyword field question

2004-03-23 Thread Chad Small
I have since learned that using the TermQuery instead of the MultiFieldQueryParser 
works for the keyword field in question below (HW-NCI_TOPICS).
 
apiQuery = new BooleanQuery();
apiQuery.add(new TermQuery(new Term(category, HW-NCI_TOPICS)), true, false);
 
This finds a match.
 
I found a message that talked about having to use the the Query API when searching 
Keyword fields in the index.  Is this true?
 
Is there not a way to get the MultiFieldQueryParser to find a match on this keyword?
 
thanks,
chad.

-Original Message- 
From: Chad Small 
Sent: Tue 3/23/2004 10:57 AM 
To: [EMAIL PROTECTED] 
Cc: 
Subject: Query syntax on Keyword field question



Hello,

How can I format a query to get a hit?

I'm using the StandardAnalyzer() at both index and search time.

If I'm indexing a field like this:

luceneDocument.add(Field.Keyword(category,HW-NCI_TOPICS));

I've tried the following with no success:

//  String searchArgs = HW\\-NCI_TOPICS;
//  String searchArgs = HW\\-NCI_TOPICS.toLowerCase();
//  String searchArgs = +HW+NCI+TOPICS;
  //this works with .Text field
//  String searchArgs = +hw+nci+topics;
//  String searchArgs = hw nci topics;

thanks,
chad.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Ype Kingma
Joachim,

...

 you think its possible to order by e.g. date field without retrieving all
 the values from the index??

Yes, the new sorting feature from CVS does that, see Doug's
last note on the subject. (It might have been on lucene-dev,
I didn't keep a copy).

Have fun,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search and Update one index with two processes simultaneously

2004-03-23 Thread brad . hendricks
Hello,

Is it possible to have two separate process, one performing searches, and
the other performing updates on the same index?  I have a system in
production that uses this design and occasionally the search program grinds
to a halt.  I first suspected that this was just a load issue, but there
isn't that much load (peak times average 2-3 requests per second, with
occasional bursts of 10-20 requests) and I can't replicate the problem.
The logs show that when the slowdown occurs we are usually answering
requests to search at first, but ongoing searches have stopped finishing
(somewhere inside IndexSearcher.search()).  There doesn't seem to be a
single expensive query that might be bringing us to our kness either.  So,
I was wondering if maybe it is possible that this is a race condition
caused by our update program, which is a separate program that updates the
index while it is being searched.

Some basic info:

The search program uses a single IndexSearcher to perform all searches.
Results are collected with a HitCollector which uses the same IndexSearcher
to extract each document - there is a requirement that the documents be
returned in a specific order, so we have an external structure to determine
the order, once the ID (not the internal ID) has been extracted.
A separate HitCollector is used for each search.
This IndexSearcher in the search program is swapped for a new one when the
update program has finished an update cycle and notifies the search
program.
The index is about 90k documents, average query returns less than 100 hits.

Thanks for any information, or just for your opinion.

Brad



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Cover density ranking?

2004-03-23 Thread Boris Goldowsky
Since there have been a few discussions recently of overriding various
aspects of Lucene's ranking formula, I got to wondering how difficult it
might be to implement something more different from the base tf/idf
ranking system that Lucene has built in.

How difficult would it be to implement something like Cover Density
ranking for Lucene?  Has anyone tried it?  

Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
and is supposed to be particularly good for short queries of the type
that you get in many web applications.

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Cover density ranking?

2004-03-23 Thread Doug Cutting
Boris Goldowsky wrote:
How difficult would it be to implement something like Cover Density
ranking for Lucene?  Has anyone tried it?  

Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
and is supposed to be particularly good for short queries of the type
that you get in many web applications.
I just glanced at the paper, so my analysis may be wrong, but I think 
one could implement cover density ranking in Lucene with spans (only in 
CVS, not in 1.3).  I think spans correspond to covers in this paper. 
But you'd need to alter SpanScorer.java to implement the cover scoring 
described in that paper.  And you'd probably need to use a custom 
Similarity implementation, which disables most other scoring (tf=1.0, 
idf=1.0, etc.), but exaggerates coordination.  Finally, you'd need to 
construct span queries.  Or something like that.

If someone tries this, please tell us how it works.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query syntax on Keyword field question

2004-03-23 Thread Erik Hatcher
QueryParser and Field.Keyword fields are a strange mix.  For some 
background, check the archives as this has been covered pretty 
extensively.

A quick answer is yes you can use MFQP and QP with keyword fields, 
however you need to be careful which analyzer you use.  
PerFieldAnalyzerWrapper is a good solution - you'll just need to use an 
analyzer for your keyword field which simply tokenizes the whole string 
as one chunk.  Perhaps such an analyzer should be made part of the 
core?

	Erik

On Mar 23, 2004, at 12:58 PM, Chad Small wrote:

I have since learned that using the TermQuery instead of the 
MultiFieldQueryParser works for the keyword field in question below 
(HW-NCI_TOPICS).

apiQuery = new BooleanQuery();
apiQuery.add(new TermQuery(new Term(category, HW-NCI_TOPICS)), 
true, false);

This finds a match.

I found a message that talked about having to use the the Query API 
when searching Keyword fields in the index.  Is this true?

Is there not a way to get the MultiFieldQueryParser to find a match on 
this keyword?

thanks,
chad.
-Original Message-
From: Chad Small
Sent: Tue 3/23/2004 10:57 AM
To: [EMAIL PROTECTED]
Cc:
Subject: Query syntax on Keyword field question


Hello,

How can I format a query to get a hit?

I'm using the StandardAnalyzer() at both index and search time.

If I'm indexing a field like this:

luceneDocument.add(Field.Keyword(category,HW-NCI_TOPICS));

I've tried the following with no success:

//  String searchArgs = HW\\-NCI_TOPICS;
//  String searchArgs = HW\\-NCI_TOPICS.toLowerCase();
//  String searchArgs = +HW+NCI+TOPICS;
  //this works with .Text field
//  String searchArgs = +hw+nci+topics;
//  String searchArgs = hw nci topics;

thanks,
chad.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Query syntax on Keyword field question

2004-03-23 Thread Chad Small
Thanks-you Erik and Incze.  I now understand the issue and I'm trying to create a 
KeywordAnalyzer as suggested from you book excerpt, Erik:
 
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6727
 
However, not being all that familiar with the Analyzer framework, I'm not sure how to 
implement the KeywordAnalyzer even though it might be trivial :)  Any hints, code, 
or messages to look at?
 
from message link above
Ok, here is the section from Lucene in Action.  I'll leave the 
development of KeywordAnalyzer as an exercise for the reader (although 
its implementation is trivial, one of the simplest analyzers possible - 
only emit one token of the entire contents).  I hope this helps.

Erik


thanks again,
chad.

-Original Message- 
From: Incze Lajos [mailto:[EMAIL PROTECTED] 
Sent: Tue 3/23/2004 8:08 PM 
To: Lucene Users List 
Cc: 
Subject: Re: Query syntax on Keyword field question



On Tue, Mar 23, 2004 at 08:10:15PM -0500, Erik Hatcher wrote:
 QueryParser and Field.Keyword fields are a strange mix.  For some
 background, check the archives as this has been covered pretty
 extensively.

 A quick answer is yes you can use MFQP and QP with keyword fields,
 however you need to be careful which analyzer you use. 
 PerFieldAnalyzerWrapper is a good solution - you'll just need to use an
 analyzer for your keyword field which simply tokenizes the whole string
 as one chunk.  Perhaps such an analyzer should be made part of the
 core?

   Erik

I've implemented suche an analyzer but it's only partial solution
if your keyword field contains spaces, as the QP would split
the query, e.g.:

NOTTOKNIZED:(term with spaces*)

would give you no hit even with an not tokenized field
term with spaces and other useful things. The full solution
would be to be able to tell the QP not to split at spaces,
either by 'do not split till apos' syntax, or by the good ol'
backslash: do\ not\ notice\ these\ spaces.

incze

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-23 Thread Chad Small
Here is my attempt at a KeywordAnalyzer - although is not working?  Excuse the length 
of the message, but wanted to give actual code.
 
package domain.lucenesearch;
 
import java.io.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharTokenizer;
import org.apache.lucene.analysis.TokenStream;
 
public class KeywordAnalyzer extends Analyzer
{
   public TokenStream tokenStream(String s, Reader reader)
   {
  return new KeywordTokenizer(reader);
   }
 
   private class KeywordTokenizer extends CharTokenizer
   {
  public KeywordTokenizer(Reader in)
  {
 super(in);
  }
  /**
   * Collects all characters.
   */
  protected boolean isTokenChar(char c)
  {
 return true;
  }
   }

However, this test: fails
 
public class KeywordAnalyzerTest extends TestCase
{
   RAMDirectory directory;
   private IndexSearcher searcher;
 
   public void setUp() throws Exception
   {
  directory = new RAMDirectory();
  IndexWriter writer = new IndexWriter(directory,
   new StandardAnalyzer(),
   true);
  Document doc = new Document();
  doc.add(Field.Keyword(category, HW-NCI_TOPICS));
  doc.add(Field.Text(description, Illidium Space Modulator));
  writer.addDocument(doc);
  writer.close();
  searcher = new IndexSearcher(directory);
   }
 
public void testPerFieldAnalyzer() throws Exception
   {
  analyze(HW-NCI_TOPICS);
 
  PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new 
StandardAnalyzer());
  analyzer.addAnalyzer(category, new KeywordAnalyzer());   //|#1
  Query query = QueryParser.parse(category:HW-NCI_TOPICS AND SPACE,
  description,
  analyzer);
  Hits hits = searcher.search(query);
  System.out.println(query.ToString =  + query.toString(description));
  assertEquals(HW-NCI_TOPICS kept as-is,
   category:HW-NCI_TOPICS +space, query.toString(description));
  assertEquals(doc found!, 1, hits.length());
   }
 
   private void analyze(String text) throws Exception
   {
  Analyzer[] analyzers = new Analyzer[]{
 new WhitespaceAnalyzer(),
 new SimpleAnalyzer(),
 new StopAnalyzer(),
 new StandardAnalyzer(),
 new KeywordAnalyzer(),
 //new SnowballAnalyzer(English, StopAnalyzer.ENGLISH_STOP_WORDS)
  };
  System.out.println(Analzying \ + text + \);
  for (int i = 0; i  analyzers.length; i++)
  {
 Analyzer analyzer = analyzers[i];
 System.out.println(\t + analyzer.getClass().getName() + :);
 System.out.print(\t\t);
 TokenStream stream = analyzer.tokenStream(category, new StringReader(text));
 while (true)
 {
Token token = stream.next();
if (token == null) break;
System.out.print([ + token.termText() + ] );
 }
 System.out.println(\n);
  }
   }
}
 
With this output:
 
Analzying HW-NCI_TOPICS
 org.apache.lucene.analysis.WhitespaceAnalyzer:
  [HW-NCI_TOPICS] 
 org.apache.lucene.analysis.SimpleAnalyzer:
  [hw] [nci] [topics] 
 org.apache.lucene.analysis.StopAnalyzer:
  [hw] [nci] [topics] 
 org.apache.lucene.analysis.standard.StandardAnalyzer:
  [hw] [nci] [topics] 
 healthecare.domain.lucenesearch.KeywordAnalyzer:
  [HW-NCI_TOPICS] 
 
query.ToString = category:HW -nci topics +space

junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is 
Expected:+category:HW-NCI_TOPICS +space
Actual  :category:HW -nci topics +space
 
See anything?
thanks,
chad.

-Original Message- 
From: Chad Small 
Sent: Tue 3/23/2004 8:48 PM 
To: Lucene Users List 
Cc: 
Subject: RE: Query syntax on Keyword field question



Thanks-you Erik and Incze.  I now understand the issue and I'm trying to 
create a KeywordAnalyzer as suggested from you book excerpt, Erik:

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6727

However, not being all that familiar with the Analyzer framework, I'm not sure 
how to implement the KeywordAnalyzer even though it might be trivial :)  Any 
hints, code, or messages to look at?

from message link above
Ok, here is the section from Lucene in Action.  I'll leave the
development of KeywordAnalyzer as an exercise for the reader (although
its implementation is trivial, one of the simplest analyzers possible -
only emit one token of the entire contents).  I hope this helps.

Erik


thanks again,
chad.

-Original Message-
From: Incze Lajos [mailto:[EMAIL PROTECTED]
Sent: Tue 3/23/2004 8:08 PM
To: Lucene Users List
Cc:

RE: Query syntax on Keyword field question

2004-03-23 Thread Morus Walter
Chad Small writes:
 Here is my attempt at a KeywordAnalyzer - although is not working?  Excuse the 
 length of the message, but wanted to give actual code.
  
 With this output:
  
 Analzying HW-NCI_TOPICS
  org.apache.lucene.analysis.WhitespaceAnalyzer:
   [HW-NCI_TOPICS] 
  org.apache.lucene.analysis.SimpleAnalyzer:
   [hw] [nci] [topics] 
  org.apache.lucene.analysis.StopAnalyzer:
   [hw] [nci] [topics] 
  org.apache.lucene.analysis.standard.StandardAnalyzer:
   [hw] [nci] [topics] 
  healthecare.domain.lucenesearch.KeywordAnalyzer:
   [HW-NCI_TOPICS] 
  
 query.ToString = category:HW -nci topics +space
 
 junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is 
 Expected:+category:HW-NCI_TOPICS +space
 Actual  :category:HW -nci topics +space
  

Well query parser does not allow `-' within words currently.
So before your analyzer is called, query parser reads one word HW, a `-'
operator, one word NCI_TOPICS.
The latter is analyzed as nci topics because it's not in field category
anymore, I guess.

I suggested to change this. See
http://issues.apache.org/bugzilla/show_bug.cgi?id=27491

Either you escape the - using category:HW\-NCI_TOPICS in your query
(untested. and I don't know where the escape character will be removed)
or you apply my suggested change.

Another option for using keywords with query parser might be adding a
keyword syntax to the query parser.
Something like category:key(HW-NCI_TOPICS) or category=HW-NCI_TOPICS.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]