RE: Lucene with English and Spanish Best Practice?

2004-08-23 Thread Chad Small

Thanks for the info Grant.


As for indexes, do you anticipate adding more fields later in Spanish? 
Is the content just a translation of the English, or do you have
separate conetent in Spanish?  Are your users querying in only one
language (cross-lingual) or are the Spanish speakers only querying
against Spanish content?

Our fields are pretty much going to be one-for-one between English and Spanish (a 
translation of current content from English to Spanish).  Something like title_en and 
title_sp, body_en and body_sp, keywords_en and keywords_sp.  Our users will be 
querying cross-lingual.  So I see your point, it looks like it would be easier if we 
added the Spanish fields to our current indexes, then we wouldn't have to filter out 
same results between English and Spanish indexes.



I am doing Arabic and English (and have done Spanish, French, and
Japanese in the past), although our cross-lingual system supports any
languages that you have resources for.

Did you use Snowball for the Spanish?  Or is there just a Lucene Spanish Analyzer 
available (I couldn't find one).  Or do people just use something like a plain old 
StandardAnalyzer to index and query Spanish content?  I'm a little confused on the 
Snowball project, is it a multi-language Stemmer Analyzer for Lucene?  We just use 
plan old Standard and Whitespace Analyzers now for our English content.  Can we just 
use those same Analyzers for Spanish content?  Or would it be better to use the 
Snowball project?

thanks,
chad.
  

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Saturday, August 21, 2004 2:16 PM
To: [EMAIL PROTECTED]
Subject: Re: Lucene with English and Spanish Best Practice?


 I think the Snowball stuff works well, although I have only used the
English Porter stemmer implementation.

As for indexes, do you anticipate adding more fields later in Spanish? 
Is the content just a translation of the English, or do you have
separate conetent in Spanish?  Are your users querying in only one
language (cross-lingual) or are the Spanish speakers only querying
against Spanish content?

I am doing Arabic and English (and have done Spanish, French, and
Japanese in the past), although our cross-lingual system supports any
languages that you have resources for.  We lean towards separate
indexes, but mostly b/c they are based on separate content.  The key is
you have to be able to match up the analysis of the query with the
analysis of the index.  Having a mixed index may make this more
difficult.  If you have a mixed index would you filter out Spanish
results that had hits from an English query?  For instance, what if the
query was a term that was common to both languages (banana, mosquito,
etc.) or are you requiring the user to specify which fields they are
searching against.  I guess we really need to know more about how your
user is going to be interacting.

-Grant

 [EMAIL PROTECTED] 8/20/2004 5:27:40 PM 
Hello,

I'm interested in any feedback from anyone who has worked through
implementing Internationalization (I18N) search with Lucene or has ideas
for this requirement.  Currently, we're using Lucene with straight
English and are looking to add Spanish to the mix (with maybe more
languages to follow).  

This is our current IndexWriter setup utilizing the
PerFieldAnalyzerWrapper:

   PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
StandardAnalyzer());
   analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
   analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
   IndexWriter writer = new IndexWriter(indexDir, analyzer, create);

Would people suggest we switch this over to Snowball so there are
English and Spanish Analyzers and IndexWriters?  Something like this:

PerFieldAnalyzerWrapper analyzerEnglish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer(English));
analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish,
create);

PerFieldAnalyzerWrapper analyzerSpanish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer(Spanish));
analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish,
create);


Are multiple indexes or mirrors of each index then usually created for
every language?  We currently have 4 indexes that are all English. 
Would we then create 4 more that are Spanish?  Then at search time we
would determine the language and which set of indexes to search against,
English or Spanish.

Or another approach could be to add a Spanish field to the existing 4
indexes since most of the indexes have only one field that will be
translated from English to Spanish.


thanks a bunch,
chad.



RE: spanish stemmer

2004-08-23 Thread Chad Small
Do you mind sharing how you implemented your SpanishAnalyzer using Snowball?

Sorry I can't help with your question.  I am trying to implement Snowball Spanish or a 
Spanish Analyzer in Lucene.

thanks,
chad.

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 8:30 AM
To: Lucene Users List
Subject: spanish stemmer 


Hello

I use the Snowball jar for implement my SpanishAnalyzer. I found that the
words finished in 'bol' are not stripped.
For example:

In spanish for say basketball, you can say basquet or basquetbol. But for
SpanishStemmer are different words.
Idem with voley and voleybol.

Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t
exist in spanish.

you think that I are correct?

you can change this?

Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: spanish stemmer

2004-08-23 Thread Chad Small
Excellent Ernesto.  

Was there a reason you used your own stop word list and not just the default 
constructor SnowballAnalyzer(Spanish)?

thanks,
chad.

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 2:03 PM
To: Lucene Users List
Subject: Re: spanish stemmer 


Yes, is too easy.

You need do a wrapper for spanish Snowball initilization.

analyzer = new SnowballAnalyzer(Spanish, SPANISH_STOP_WORDS);

above the complete code.

Bye, Ernesto.


--
public class SpanishAnalyzer extends Analyzer {

private static SnowballAnalyzer analyzer;


private String SPANISH_STOP_WORDS[] = {

un, una, unas, unos, uno, sobre, todo, también, tras,
otro, algún, alguno, alguna,

algunos, algunas, ser, es, soy, eres, somos, sois, estoy,
esta, estamos, estais,

estan, en, para, atras, porque, por qué, estado, estaba,
ante, antes, siendo,

ambos, pero, por, poder, puede, puedo, podemos, podeis,
pueden, fui, fue, fuimos,

fueron, hacer, hago, hace, hacemos, haceis, hacen, cada,
fin, incluso, primero,

desde, conseguir, consigo, consigue, consigues, conseguimos,
consiguen, ir, voy, va,

vamos, vais, van, vaya, bueno, ha, tener, tengo, tiene,
tenemos, teneis, tienen,

el, la, lo, las, los, su, aqui, mio, tuyo, ellos,
ellas, nos, nosotros, vosotros,

vosotras, si, dentro, solo, solamente, saber, sabes, sabe,
sabemos, sabeis, saben,

ultimo, largo, bastante, haces, muchos, aquellos, aquellas,
sus, entonces, tiempo,

verdad, verdadero, verdadera, cierto, ciertos, cierta,
ciertas, intentar, intento,

intenta, intentas, intentamos, intentais, intentan, dos, bajo,
arriba, encima, usar,

uso, usas, usa, usamos, usais, usan, emplear, empleo,
empleas, emplean, ampleamos,

empleais, valor, muy, era, eras, eramos, eran, modo, bien,
cual, cuando, donde,

mientras, quien, con, entre, sin, trabajo, trabajar,
trabajas, trabaja, trabajamos,

trabajais, trabajan, podria, podrias, podriamos, podrian,
podriais, yo, aquel, mi,

de, a, e, i, o, u};

public SpanishAnalyzer() {

analyzer = new SnowballAnalyzer(Spanish, SPANISH_STOP_WORDS);

}

public SpanishAnalyzer(String stopWords[]) {

analyzer = new SnowballAnalyzer(Spanish, stopWords);

}

public TokenStream tokenStream(String fieldName, Reader reader) {

return analyzer.tokenStream(fieldName, reader);

}

}



- Original Message - 
From: Chad Small [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 3:49 PM
Subject: RE: spanish stemmer


Do you mind sharing how you implemented your SpanishAnalyzer using Snowball?

Sorry I can't help with your question.  I am trying to implement Snowball
Spanish or a Spanish Analyzer in Lucene.

thanks,
chad.

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 8:30 AM
To: Lucene Users List
Subject: spanish stemmer


Hello

I use the Snowball jar for implement my SpanishAnalyzer. I found that the
words finished in 'bol' are not stripped.
For example:

In spanish for say basketball, you can say basquet or basquetbol. But for
SpanishStemmer are different words.
Idem with voley and voleybol.

Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t
exist in spanish.

you think that I are correct?

you can change this?

Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: spanish stemmer

2004-08-23 Thread Chad Small
One more question to the group.  From what I have gathered, my choices for indexing 
and querying Spanish content are:

1.  StandardAnalyzer (I read that this analyzer could be used for European languages)

2.  SnowballAnalyzer(Spanish, SPANISH_STOP_WORDS);  --custom stop words from 
Ernesto class below

Can I assume that choice 2 would be the better for Spanish content?

thanks,
chad.



-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 3:31 PM
To: Lucene Users List
Subject: Re: spanish stemmer 


Because the SnowballAnalyzer, and SpanishStemmer don´t have a default
stopword set.

SnowballAnalyzer constructor:

  /** Builds the named analyzer with no stop words. */
  public SnowballAnalyzer(String name) {
this.name = name;
  }

Note the comment.

Bye,
Ernesto.

- Original Message - 
From: Chad Small [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 4:57 PM
Subject: RE: spanish stemmer


Excellent Ernesto.

Was there a reason you used your own stop word list and not just the default
constructor SnowballAnalyzer(Spanish)?

thanks,
chad.

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 2:03 PM
To: Lucene Users List
Subject: Re: spanish stemmer


Yes, is too easy.

You need do a wrapper for spanish Snowball initilization.

analyzer = new SnowballAnalyzer(Spanish, SPANISH_STOP_WORDS);

above the complete code.

Bye, Ernesto.


--
public class SpanishAnalyzer extends Analyzer {

private static SnowballAnalyzer analyzer;


private String SPANISH_STOP_WORDS[] = {

un, una, unas, unos, uno, sobre, todo, también, tras,
otro, algún, alguno, alguna,

algunos, algunas, ser, es, soy, eres, somos, sois, estoy,
esta, estamos, estais,

estan, en, para, atras, porque, por qué, estado, estaba,
ante, antes, siendo,

ambos, pero, por, poder, puede, puedo, podemos, podeis,
pueden, fui, fue, fuimos,

fueron, hacer, hago, hace, hacemos, haceis, hacen, cada,
fin, incluso, primero,

desde, conseguir, consigo, consigue, consigues, conseguimos,
consiguen, ir, voy, va,

vamos, vais, van, vaya, bueno, ha, tener, tengo, tiene,
tenemos, teneis, tienen,

el, la, lo, las, los, su, aqui, mio, tuyo, ellos,
ellas, nos, nosotros, vosotros,

vosotras, si, dentro, solo, solamente, saber, sabes, sabe,
sabemos, sabeis, saben,

ultimo, largo, bastante, haces, muchos, aquellos, aquellas,
sus, entonces, tiempo,

verdad, verdadero, verdadera, cierto, ciertos, cierta,
ciertas, intentar, intento,

intenta, intentas, intentamos, intentais, intentan, dos, bajo,
arriba, encima, usar,

uso, usas, usa, usamos, usais, usan, emplear, empleo,
empleas, emplean, ampleamos,

empleais, valor, muy, era, eras, eramos, eran, modo, bien,
cual, cuando, donde,

mientras, quien, con, entre, sin, trabajo, trabajar,
trabajas, trabaja, trabajamos,

trabajais, trabajan, podria, podrias, podriamos, podrian,
podriais, yo, aquel, mi,

de, a, e, i, o, u};

public SpanishAnalyzer() {

analyzer = new SnowballAnalyzer(Spanish, SPANISH_STOP_WORDS);

}

public SpanishAnalyzer(String stopWords[]) {

analyzer = new SnowballAnalyzer(Spanish, stopWords);

}

public TokenStream tokenStream(String fieldName, Reader reader) {

return analyzer.tokenStream(fieldName, reader);

}

}





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene with English and Spanish Best Practice?

2004-08-20 Thread Chad Small
Hello,

I'm interested in any feedback from anyone who has worked through implementing 
Internationalization (I18N) search with Lucene or has ideas for this requirement.  
Currently, we're using Lucene with straight English and are looking to add Spanish to 
the mix (with maybe more languages to follow).  

This is our current IndexWriter setup utilizing the PerFieldAnalyzerWrapper:

   PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new 
StandardAnalyzer());
   analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer());
   analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
   IndexWriter writer = new IndexWriter(indexDir, analyzer, create);

Would people suggest we switch this over to Snowball so there are English and Spanish 
Analyzers and IndexWriters?  Something like this:

PerFieldAnalyzerWrapper analyzerEnglish = new PerFieldAnalyzerWrapper(new 
SnowballAnalyzer(English));
analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer());
analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish, create);

PerFieldAnalyzerWrapper analyzerSpanish = new PerFieldAnalyzerWrapper(new 
SnowballAnalyzer(Spanish));
analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer());
analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish, create);


Are multiple indexes or mirrors of each index then usually created for every language? 
 We currently have 4 indexes that are all English.  Would we then create 4 more that 
are Spanish?  Then at search time we would determine the language and which set of 
indexes to search against, English or Spanish.

Or another approach could be to add a Spanish field to the existing 4 indexes since 
most of the indexes have only one field that will be translated from English to 
Spanish.


thanks a bunch,
chad.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching in all

2004-04-01 Thread Chad Small
See MultiFieldQueryParser, like this:
 
String[] fields = getFieldsArray();
  Query multiFieldQuery = MultiFieldQueryParser.parse(this.queryString,
  fields,
  new StandardAnalyzer());
  System.out.println(multiFieldQuery:  + multiFieldQuery.toString());

-Original Message- 
From: Tate Avery [mailto:[EMAIL PROTECTED] 
Sent: Thu 4/1/2004 9:30 AM 
To: [EMAIL PROTECTED] 
Cc: 
Subject: Searching in all



Hello,

If I have, for example, 3 fields in my document (title, body, notes)... is 
there some easy what to search 'all'?


Below are the only 2 ideas I currently have/use:

1) If I want to search for 'x' in all, I do something like:
title:x OR body:x OR notes:x

... but this does not really work if you are search for (a AND b) and a is in 
the title and b is in the notes, etc... leading to an explosion of boolean 
combinations it seems.


2) Actually index an 'all' field for my document by just concatenating the 
content from the title, body, and notes fields.
... but this doubles my index size.  :(


So, is there a better way out there?

Thanks,
Tate

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: too many files open error

2004-03-26 Thread Chad Small
Is this :) serious?  
 
Because we have a need/interest in the new field sorting capabilities and QueryParser 
keyword handling of dashes (-) that would be in 1.4, I believe.  It's so much easier 
to explain that we'll use a final release of Lucene instead of a dev build Lucene. 
 
 
If so, what would an expected release date be?
 
thanks,
chad.

-Original Message- 
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Fri 3/26/2004 12:23 PM 
To: Lucene Users List 
Cc: 
Subject: Re: too many files open error



The compound format was added to Lucene 1.3 and was not part of 1.2. 
I'd definitely recommend upgrading.  Heck, Lucene 1.4 could be released
any day now :)

Erik


On Mar 26, 2004, at 12:25 PM, Charlie Smith wrote:

 I'm using lucene-1.2.jar as part of the build for this docSearcher
 application.
 Would these recommendations work for this or should I upgrade to
 lucene 1.3.

 In doing so, I'm not sure if a rewrite of the docSearcher will be
 necessary or
 not.


 Daniel Naber wrote on 3/26/04:
 Try IndexWriter.setUseCompoundFile(true) to limit the number of files.

 Erik Hatcher 3/26/2004 2:32:16 AM 
 If you are using Lucene 1.3, try using the index in compound format.
 You will have to rebuild (or convert) your index to this format.  The
 handy utility Luke will convert an index easily.

   Erik


 On Mar 25, 2004, at 9:34 PM, Charlie Smith wrote:

 I need to get solution to following error ASAP.  Please help me with
 this.
 I'm getting following error returned from call to

 snip

 try {
 searcher = new IndexSearcher(
 IndexReader.open(indexName) //create an
 indexSearcher for our page
 );
 } catch (Exception e) { //any error
 that
 happens is probably due
 //to a
 permission
 problem or non-existant
 //or otherwise
 corrupt
 index
 %
 pERROR opening the Index - contact sysadmin!/p
 pWhile parsing query: %=e.getMessage()%/p
 %error = true;
 //don't do
 anything up to the footer
 }



 Output:
 ERROR opening the Index - contact sysadmin!

 While parsing query:
 /opt/famhistdev/fhstage/jbin/.docSearcher/indexes/fhstage_update/
 _3ff.f6 (Too
 many open files)

 /snip

 Charlie
 3/25/04



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene 1.4 - lobby for final release

2004-03-26 Thread Chad Small
thanks Erik.  Ok this is my official lobby effort for the release of 1.4 to final 
status.  Anyone else need/want a 1.4 release?
 
Does anyone have any information on 1.4 release plans?
 
thanks,
chad.

-Original Message- 
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Fri 3/26/2004 1:25 PM 
To: Lucene Users List 
Cc: 
Subject: Re: too many files open error



On Mar 26, 2004, at 1:33 PM, Chad Small wrote:
 Is this :) serious?

This is open-source.   I'm only as serious as it would take for someone
to push it through.  I don't know what the timeline is, although lots
of new features are available.

 Because we have a need/interest in the new field sorting capabilities
 and QueryParser keyword handling of dashes (-) that would be in 1.4,
 I believe.  It's so much easier to explain that we'll use a final
 release of Lucene instead of a dev build Lucene.

Why explain it?!  Just show great results and let that be the
explanation :)


 If so, what would an expected release date be?

*shrug* - feel free to lobby for it.  I don't know what else is planned
before a release.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-24 Thread Chad Small
Great info Morus,
 
After making the escape the dash change to the QueryParser:
 
Query query = QueryParser.parse(+category:HW\\-NCI_TOPICS AND SPACE,
  description,
  analyzer);
  Hits hits = searcher.search(query);
  System.out.println(query.ToString =  + query.toString(description));
  assertEquals(HW-NCI_TOPICS kept as-is,
   +category:HW\\-NCI_TOPICS +space, query.toString(description)); 
 --note that this passes with the escape put in, so not as-is.
  assertEquals(doc found!, 1, hits.length());
 
I'm still getting this output:
 
 domain.lucenesearch.KeywordAnalyzer:
  [HW-NCI_TOPICS] 
 
query.ToString = +category:HW\-NCI_TOPICS +space
 
junit.framework.AssertionFailedError: doc found! expected:1 but was:0
 
It look like bug, http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 
http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 , was fixed today:
 
--- Additional Comments From Otis Gospodnetic mailto:[EMAIL PROTECTED]  
2004-03-24 10:10 ---

Although tft-monitor should not really result in a phrase query tft monitor, I
agree that this is better than converting it to tft AND NOT monitor (tft -monitor).
Moreover, I have seen query syntax where '-' characters are used for phrase
queries instead or in addition to quotes, so one could use either morus-walter
or morus walter.

I applied your change, as it doesn't look like it breaks anything, and I hope
nobody relied on ill behaviour where tft-monitor would result in AND NOT query.
---
But I assume this fix won't come out for some time.  Is there a way I can get this fix 
sooner?  
I'm up against a deadline and would very much like this functionality. 
 
And to go one more step with the KeywordAnalyzer that I wrote, changing this method to 
skip the escape:
protected boolean isTokenChar(char c)
{
 if (c == '\\')
 {
return false;
 }
 else
 {
return true;
 }
  }
The test then returns with a space:
 healthecare.domain.lucenesearch.KeywordAnalyzer:
  [HW-NCI_TOPICS] 
query.ToString = +category:HW -NCI_TOPICS +space
junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is 
Expected:+category:HW\-NCI_TOPICS +space
Actual  :+category:HW -NCI_TOPICS +space   note space where escape was.
thanks,
chad.

-Original Message- 
From: Morus Walter [mailto:[EMAIL PROTECTED] 
Sent: Wed 3/24/2004 1:43 AM 
To: Lucene Users List 
Cc: 
Subject: RE: Query syntax on Keyword field question



Chad Small writes:
 Here is my attempt at a KeywordAnalyzer - although is not working?  Excuse 
the length of the message, but wanted to give actual code.
 
 With this output:
 
 Analzying HW-NCI_TOPICS
  org.apache.lucene.analysis.WhitespaceAnalyzer:
   [HW-NCI_TOPICS]
  org.apache.lucene.analysis.SimpleAnalyzer:
   [hw] [nci] [topics]
  org.apache.lucene.analysis.StopAnalyzer:
   [hw] [nci] [topics]
  org.apache.lucene.analysis.standard.StandardAnalyzer:
   [hw] [nci] [topics]
  healthecare.domain.lucenesearch.KeywordAnalyzer:
   [HW-NCI_TOPICS]
 
 query.ToString = category:HW -nci topics +space

 junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is
 Expected:+category:HW-NCI_TOPICS +space
 Actual  :category:HW -nci topics +space
 

Well query parser does not allow `-' within words currently.
So before your analyzer is called, query parser reads one word HW, a `-'
operator, one word NCI_TOPICS.
The latter is analyzed as nci topics because it's not in field category
anymore, I guess.

I suggested to change this. See
http://issues.apache.org/bugzilla/show_bug.cgi?id=27491

Either you escape the - using category:HW\-NCI_TOPICS in your query
(untested. and I don't know where the escape character will be removed)
or you apply my suggested change.

Another option for using keywords with query parser might be adding a
keyword syntax to the query parser.
Something like category:key(HW-NCI_TOPICS) or category=HW-NCI_TOPICS.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-24 Thread Chad Small
thanks.  I was in the process of getting javacc3.2 setup.  I'll have to hunt for 2.x.
 
chad.

-Original Message- 
From: Morus Walter [mailto:[EMAIL PROTECTED] 
Sent: Wed 3/24/2004 8:00 AM 
To: Lucene Users List 
Cc: 
Subject: RE: Query syntax on Keyword field question



Hi Chad,

 But I assume this fix won't come out for some time.  Is there a way I can 
get this fix sooner? 
 I'm up against a deadline and would very much like this functionality.

Just get lucenes sources, change the line and recompile.
The difficult part is to get a copy of JavaCC 2 (3 won't do), but I think
this can be found in the archives.

 
 And to go one more step with the KeywordAnalyzer that I wrote, changing this 
method to skip the escape:
 protected boolean isTokenChar(char c)
 {
  if (c == '\\')
  {
 return false;
  }
  else
  {
 return true;
  }
   }
 The test then returns with a space:
  healthecare.domain.lucenesearch.KeywordAnalyzer:
   [HW-NCI_TOPICS]
 query.ToString = +category:HW -NCI_TOPICS +space
 junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is
 Expected:+category:HW\-NCI_TOPICS +space
 Actual  :+category:HW -NCI_TOPICS +space   note space where escape 
was.

Sure. If \ isn't a token char, it end's the token.
So you will have to look for a different way of implementing the
analyzer. Shouldn't be that difficult since you have only one token.

Maybe it should be the job of the query parser to remove the escape character
(would make more sense to me at least) but that would be another change
of the query parser...

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-24 Thread Chad Small
For others reference - here is the old version url:
 
https://javacc.dev.java.net/servlets/ProjectDocumentList?folderID=212

-Original Message- 
From: Chad Small 
Sent: Wed 3/24/2004 8:07 AM 
To: Lucene Users List 
Cc: 
Subject: RE: Query syntax on Keyword field question



thanks.  I was in the process of getting javacc3.2 setup.  I'll have to hunt 
for 2.x.

chad.

-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Wed 3/24/2004 8:00 AM
To: Lucene Users List
Cc:
Subject: RE: Query syntax on Keyword field question
   
   

Hi Chad,
   
 But I assume this fix won't come out for some time.  Is there a way 
I can get this fix sooner?
 I'm up against a deadline and would very much like this 
functionality.
   
Just get lucenes sources, change the line and recompile.
The difficult part is to get a copy of JavaCC 2 (3 won't do), but I 
think
this can be found in the archives.
   

 And to go one more step with the KeywordAnalyzer that I wrote, 
changing this method to skip the escape:
 protected boolean isTokenChar(char c)
 {
  if (c == '\\')
  {
 return false;
  }
  else
  {
 return true;
  }
   }
 The test then returns with a space:
  healthecare.domain.lucenesearch.KeywordAnalyzer:
   [HW-NCI_TOPICS]
 query.ToString = +category:HW -NCI_TOPICS +space
 junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is
 Expected:+category:HW\-NCI_TOPICS +space
 Actual  :+category:HW -NCI_TOPICS +space   note space where 
escape was.
   
Sure. If \ isn't a token char, it end's the token.
So you will have to look for a different way of implementing the
analyzer. Shouldn't be that difficult since you have only one token.
   
Maybe it should be the job of the query parser to remove the escape 
character
(would make more sense to me at least) but that would be another change
of the query parser...
   
Morus
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-24 Thread Chad Small
I'm getting this with 3.2:
 
javacc-check:
BUILD FAILED
file:D:/applications/lucene-1.3-final/build.xml:97:
  ##
  JavaCC not found.
  JavaCC Home: /applications/javacc-3.2/bin
  JavaCC JAR: D:\applications\javacc-3.2\bin\bin\lib\javacc.jar
  Please download and install JavaCC from:
  http://javacc.dev.java.net
  Then, create a build.properties file either in your home
  directory, or within the Lucene directory and set the javacc.home
  property to the path where JavaCC is installed. For example,
  if you installed JavaCC in /usr/local/java/javacc-3.2, then set the
  javacc.home property to:
  javacc.home=/usr/local/java/javacc-3.2
  If you get an error like the one below, then you have not installed
  things correctly. Please check all your paths and try again.
  java.lang.NoClassDefFoundError: org.javacc.parser.Main
  ##
 
even though I put a build.properties file in my root lucene directory with this in it:
javacc.home=/applications/javacc-3.2/bin
 
hmm?

-Original Message- 
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wed 3/24/2004 8:29 AM 
To: Lucene Users List 
Cc: 
Subject: RE: Query syntax on Keyword field question



JavaCC 3.2 works for me.

Otis

--- Chad Small [EMAIL PROTECTED] wrote:
 thanks.  I was in the process of getting javacc3.2 setup.  I'll have
 to hunt for 2.x.
 
 chad.

   -Original Message-
   From: Morus Walter [mailto:[EMAIL PROTECTED]
   Sent: Wed 3/24/2004 8:00 AM
   To: Lucene Users List
   Cc:
   Subject: RE: Query syntax on Keyword field question
  
  

   Hi Chad,
  
But I assume this fix won't come out for some time.  Is there a
 way I can get this fix sooner?
I'm up against a deadline and would very much like this
 functionality.
  
   Just get lucenes sources, change the line and recompile.
   The difficult part is to get a copy of JavaCC 2 (3 won't do), but I
 think
   this can be found in the archives.
  
   
And to go one more step with the KeywordAnalyzer that I wrote,
 changing this method to skip the escape:
protected boolean isTokenChar(char c)
{
 if (c == '\\')
 {
return false;
 }
 else
 {
return true;
 }
  }
The test then returns with a space:
 healthecare.domain.lucenesearch.KeywordAnalyzer:
  [HW-NCI_TOPICS]
query.ToString = +category:HW -NCI_TOPICS +space
junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is
Expected:+category:HW\-NCI_TOPICS +space
Actual  :+category:HW -NCI_TOPICS +space   note space where
 escape was.
  
   Sure. If \ isn't a token char, it end's the token.
   So you will have to look for a different way of implementing the
   analyzer. Shouldn't be that difficult since you have only one token.
  
   Maybe it should be the job of the query parser to remove the escape
 character
   (would make more sense to me at least) but that would be another
 change
   of the query parser...
  
   Morus
  

 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  

 
-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-24 Thread Chad Small
Ahh, without the bin on the javacc.home - 3.2 seems to work for me to.

-Original Message- 
From: Chad Small 
Sent: Wed 3/24/2004 8:34 AM 
To: Lucene Users List 
Cc: 
Subject: RE: Query syntax on Keyword field question



I'm getting this with 3.2:

javacc-check:
BUILD FAILED
file:D:/applications/lucene-1.3-final/build.xml:97:
  ##
  JavaCC not found.
  JavaCC Home: /applications/javacc-3.2/bin
  JavaCC JAR: D:\applications\javacc-3.2\bin\bin\lib\javacc.jar
  Please download and install JavaCC from:
  http://javacc.dev.java.net
  Then, create a build.properties file either in your home
  directory, or within the Lucene directory and set the javacc.home
  property to the path where JavaCC is installed. For example,
  if you installed JavaCC in /usr/local/java/javacc-3.2, then set the
  javacc.home property to:
  javacc.home=/usr/local/java/javacc-3.2
  If you get an error like the one below, then you have not installed
  things correctly. Please check all your paths and try again.
  java.lang.NoClassDefFoundError: org.javacc.parser.Main
  ##

even though I put a build.properties file in my root lucene directory with 
this in it:
javacc.home=/applications/javacc-3.2/bin

hmm?

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wed 3/24/2004 8:29 AM
To: Lucene Users List
Cc:
Subject: RE: Query syntax on Keyword field question
   
   

JavaCC 3.2 works for me.
   
Otis
   
--- Chad Small [EMAIL PROTECTED] wrote:
 thanks.  I was in the process of getting javacc3.2 setup.  I'll have
 to hunt for 2.x.

 chad.

   -Original Message-
   From: Morus Walter [mailto:[EMAIL PROTECTED]
   Sent: Wed 3/24/2004 8:00 AM
   To: Lucene Users List
   Cc:
   Subject: RE: Query syntax on Keyword field question
 
 

   Hi Chad,
 
But I assume this fix won't come out for some time.  Is 
there a
 way I can get this fix sooner?
I'm up against a deadline and would very much like this
 functionality.
 
   Just get lucenes sources, change the line and recompile.
   The difficult part is to get a copy of JavaCC 2 (3 won't do), 
but I
 think
   this can be found in the archives.
 
   
And to go one more step with the KeywordAnalyzer that I 
wrote,
 changing this method to skip the escape:
protected boolean isTokenChar(char c)
{
 if (c == '\\')
 {
return false;
 }
 else
 {
return true;
 }
  }
The test then returns with a space:
 healthecare.domain.lucenesearch.KeywordAnalyzer:
  [HW-NCI_TOPICS]
query.ToString = +category:HW -NCI_TOPICS +space
junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is
Expected:+category:HW\-NCI_TOPICS +space
Actual  :+category:HW -NCI_TOPICS +space   note space 
where
 escape was.
 
   Sure. If \ isn't a token char, it end's the token.
   So you will have to look for a different way of implementing 
the
   analyzer. Shouldn't be that difficult since you have only one 
token.
 
   Maybe it should be the job of the query parser to remove the 
escape
 character
   (would make more sense to me at least) but that would be 
another
 change

How to order search results by Field value?

2004-03-24 Thread Chad Small
Was there any conclusion to message:
 
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6762
 
Regarding Ordering by a Field?  I have a similar need and didn't see the resolusion 
in that thread.  Is it a current patch to the 1.3-final, I could see one?  
 
My other option, I guess, is just to code a comparator on a collection build off of 
the Hits.
 
thanks,
chad.


Query syntax on Keyword field question

2004-03-23 Thread Chad Small
Hello,
 
How can I format a query to get a hit?
 
I'm using the StandardAnalyzer() at both index and search time.
 
If I'm indexing a field like this:
 
luceneDocument.add(Field.Keyword(category,HW-NCI_TOPICS));

I've tried the following with no success:
 
//  String searchArgs = HW\\-NCI_TOPICS;
//  String searchArgs = HW\\-NCI_TOPICS.toLowerCase();
//  String searchArgs = +HW+NCI+TOPICS;
  //this works with .Text field
//  String searchArgs = +hw+nci+topics;
//  String searchArgs = hw nci topics;
 
thanks,
chad.


RE: Query syntax on Keyword field question

2004-03-23 Thread Chad Small
I have since learned that using the TermQuery instead of the MultiFieldQueryParser 
works for the keyword field in question below (HW-NCI_TOPICS).
 
apiQuery = new BooleanQuery();
apiQuery.add(new TermQuery(new Term(category, HW-NCI_TOPICS)), true, false);
 
This finds a match.
 
I found a message that talked about having to use the the Query API when searching 
Keyword fields in the index.  Is this true?
 
Is there not a way to get the MultiFieldQueryParser to find a match on this keyword?
 
thanks,
chad.

-Original Message- 
From: Chad Small 
Sent: Tue 3/23/2004 10:57 AM 
To: [EMAIL PROTECTED] 
Cc: 
Subject: Query syntax on Keyword field question



Hello,

How can I format a query to get a hit?

I'm using the StandardAnalyzer() at both index and search time.

If I'm indexing a field like this:

luceneDocument.add(Field.Keyword(category,HW-NCI_TOPICS));

I've tried the following with no success:

//  String searchArgs = HW\\-NCI_TOPICS;
//  String searchArgs = HW\\-NCI_TOPICS.toLowerCase();
//  String searchArgs = +HW+NCI+TOPICS;
  //this works with .Text field
//  String searchArgs = +hw+nci+topics;
//  String searchArgs = hw nci topics;

thanks,
chad.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-23 Thread Chad Small
Thanks-you Erik and Incze.  I now understand the issue and I'm trying to create a 
KeywordAnalyzer as suggested from you book excerpt, Erik:
 
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6727
 
However, not being all that familiar with the Analyzer framework, I'm not sure how to 
implement the KeywordAnalyzer even though it might be trivial :)  Any hints, code, 
or messages to look at?
 
from message link above
Ok, here is the section from Lucene in Action.  I'll leave the 
development of KeywordAnalyzer as an exercise for the reader (although 
its implementation is trivial, one of the simplest analyzers possible - 
only emit one token of the entire contents).  I hope this helps.

Erik


thanks again,
chad.

-Original Message- 
From: Incze Lajos [mailto:[EMAIL PROTECTED] 
Sent: Tue 3/23/2004 8:08 PM 
To: Lucene Users List 
Cc: 
Subject: Re: Query syntax on Keyword field question



On Tue, Mar 23, 2004 at 08:10:15PM -0500, Erik Hatcher wrote:
 QueryParser and Field.Keyword fields are a strange mix.  For some
 background, check the archives as this has been covered pretty
 extensively.

 A quick answer is yes you can use MFQP and QP with keyword fields,
 however you need to be careful which analyzer you use. 
 PerFieldAnalyzerWrapper is a good solution - you'll just need to use an
 analyzer for your keyword field which simply tokenizes the whole string
 as one chunk.  Perhaps such an analyzer should be made part of the
 core?

   Erik

I've implemented suche an analyzer but it's only partial solution
if your keyword field contains spaces, as the QP would split
the query, e.g.:

NOTTOKNIZED:(term with spaces*)

would give you no hit even with an not tokenized field
term with spaces and other useful things. The full solution
would be to be able to tell the QP not to split at spaces,
either by 'do not split till apos' syntax, or by the good ol'
backslash: do\ not\ notice\ these\ spaces.

incze

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query syntax on Keyword field question

2004-03-23 Thread Chad Small
Here is my attempt at a KeywordAnalyzer - although is not working?  Excuse the length 
of the message, but wanted to give actual code.
 
package domain.lucenesearch;
 
import java.io.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharTokenizer;
import org.apache.lucene.analysis.TokenStream;
 
public class KeywordAnalyzer extends Analyzer
{
   public TokenStream tokenStream(String s, Reader reader)
   {
  return new KeywordTokenizer(reader);
   }
 
   private class KeywordTokenizer extends CharTokenizer
   {
  public KeywordTokenizer(Reader in)
  {
 super(in);
  }
  /**
   * Collects all characters.
   */
  protected boolean isTokenChar(char c)
  {
 return true;
  }
   }

However, this test: fails
 
public class KeywordAnalyzerTest extends TestCase
{
   RAMDirectory directory;
   private IndexSearcher searcher;
 
   public void setUp() throws Exception
   {
  directory = new RAMDirectory();
  IndexWriter writer = new IndexWriter(directory,
   new StandardAnalyzer(),
   true);
  Document doc = new Document();
  doc.add(Field.Keyword(category, HW-NCI_TOPICS));
  doc.add(Field.Text(description, Illidium Space Modulator));
  writer.addDocument(doc);
  writer.close();
  searcher = new IndexSearcher(directory);
   }
 
public void testPerFieldAnalyzer() throws Exception
   {
  analyze(HW-NCI_TOPICS);
 
  PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new 
StandardAnalyzer());
  analyzer.addAnalyzer(category, new KeywordAnalyzer());   //|#1
  Query query = QueryParser.parse(category:HW-NCI_TOPICS AND SPACE,
  description,
  analyzer);
  Hits hits = searcher.search(query);
  System.out.println(query.ToString =  + query.toString(description));
  assertEquals(HW-NCI_TOPICS kept as-is,
   category:HW-NCI_TOPICS +space, query.toString(description));
  assertEquals(doc found!, 1, hits.length());
   }
 
   private void analyze(String text) throws Exception
   {
  Analyzer[] analyzers = new Analyzer[]{
 new WhitespaceAnalyzer(),
 new SimpleAnalyzer(),
 new StopAnalyzer(),
 new StandardAnalyzer(),
 new KeywordAnalyzer(),
 //new SnowballAnalyzer(English, StopAnalyzer.ENGLISH_STOP_WORDS)
  };
  System.out.println(Analzying \ + text + \);
  for (int i = 0; i  analyzers.length; i++)
  {
 Analyzer analyzer = analyzers[i];
 System.out.println(\t + analyzer.getClass().getName() + :);
 System.out.print(\t\t);
 TokenStream stream = analyzer.tokenStream(category, new StringReader(text));
 while (true)
 {
Token token = stream.next();
if (token == null) break;
System.out.print([ + token.termText() + ] );
 }
 System.out.println(\n);
  }
   }
}
 
With this output:
 
Analzying HW-NCI_TOPICS
 org.apache.lucene.analysis.WhitespaceAnalyzer:
  [HW-NCI_TOPICS] 
 org.apache.lucene.analysis.SimpleAnalyzer:
  [hw] [nci] [topics] 
 org.apache.lucene.analysis.StopAnalyzer:
  [hw] [nci] [topics] 
 org.apache.lucene.analysis.standard.StandardAnalyzer:
  [hw] [nci] [topics] 
 healthecare.domain.lucenesearch.KeywordAnalyzer:
  [HW-NCI_TOPICS] 
 
query.ToString = category:HW -nci topics +space

junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is 
Expected:+category:HW-NCI_TOPICS +space
Actual  :category:HW -nci topics +space
 
See anything?
thanks,
chad.

-Original Message- 
From: Chad Small 
Sent: Tue 3/23/2004 8:48 PM 
To: Lucene Users List 
Cc: 
Subject: RE: Query syntax on Keyword field question



Thanks-you Erik and Incze.  I now understand the issue and I'm trying to 
create a KeywordAnalyzer as suggested from you book excerpt, Erik:

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6727

However, not being all that familiar with the Analyzer framework, I'm not sure 
how to implement the KeywordAnalyzer even though it might be trivial :)  Any 
hints, code, or messages to look at?

from message link above
Ok, here is the section from Lucene in Action.  I'll leave the
development of KeywordAnalyzer as an exercise for the reader (although
its implementation is trivial, one of the simplest analyzers possible -
only emit one token of the entire contents).  I hope this helps.

Erik


thanks again,
chad.

-Original Message-
From: Incze Lajos [mailto:[EMAIL PROTECTED]
Sent: Tue 3/23/2004 8:08 PM
To: Lucene Users List
Cc