Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Friday 02 April 2004 23:48, Erik Hatcher wrote:
 On Apr 2, 2004, at 10:00 AM, [EMAIL PROTECTED] wrote:
  On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
  Field.Keyword is suitable for storing data like Url.  Give that a try.
 
  I just tried this a minute ago and found that I cannot use wildcards
  with
  Keywords: url:www.yahoo.*

 You *can* use wildcards with keywords (in fact, a keyword really has no
 meaning once indexed - everything is a term at that point).

Well, I just tried. I  also was surprised actually - but it just didn't work.

I can use wildcards for

  doc.add(Field.Text(url, row.getString(url)));

but I cannot for

  doc.add(Field.Keyword(url, row.getString(url)));

   - create a utility (I've posted one on the list in the past) that
 shows what your analyzer is doing graphically.

Interesting. Can you give me subject/date of that posting?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread Erik Hatcher
On Apr 3, 2004, at 3:19 AM, [EMAIL PROTECTED] wrote:
You *can* use wildcards with keywords (in fact, a keyword really has 
no
meaning once indexed - everything is a term at that point).
Well, I just tried. I  also was surprised actually - but it just 
didn't work.

I can use wildcards for

  doc.add(Field.Text(url, row.getString(url)));

but I cannot for

  doc.add(Field.Keyword(url, row.getString(url)));

  - create a utility (I've posted one on the list in the past) that
shows what your analyzer is doing graphically.
Interesting. Can you give me subject/date of that posting?
AnalysisDemo in this article: 
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Provide us the results of running your url through that, using the same 
analyzer you are using, and also do the same on .toString of the query 
you parsed.  Those two pieces of info will tell all.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Saturday 03 April 2004 11:48, Erik Hatcher wrote:
 Provide us the results of running your url through that, using the same

SnowballAnalyzer(German2):

Analzying http://www.yahoo.com/foo/bar.html;
org.apache.lucene.analysis.WhitespaceAnalyzer:
[http://www.yahoo.com/foo/bar.html] 

org.apache.lucene.analysis.SimpleAnalyzer:
[http] [www] [yahoo] [com] [foo] [bar] [html] 

org.apache.lucene.analysis.StopAnalyzer:
[http] [www] [yahoo] [com] [foo] [bar] [html] 

org.apache.lucene.analysis.standard.StandardAnalyzer:
[http] [www.yahoo.com] [foo] [bar.html] 

org.apache.lucene.analysis.snowball.SnowballAnalyzer:
[http] [www.yahoo.com] [foo] [bar.html] 

 analyzer you are using, and also do the same on .toString of the query
 you parsed.  Those two pieces of info will tell all.

url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* 
url:www.yahoo*

Well, I actually use a MultiFieldQueryParser, that's probably why the term 
does appear so often. Strange parser, it should be clear that am explicit 
url:xyz should only look in the url field, shouldn't it?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread Erik Hatcher
Ok, we're getting somewhere now.

So, where is the exception you encountered when using this utility 
code?!  (i.e. it didn't thrown an exception, so something is different 
in your usage in your code).

I tried this:

Query query = MultiFieldQueryParser.parse(date:[20030101 TO 
20030202], new String[] { id, title, summary, contents, date 
}, new GermanAnalyzer());

System.out.println(query =  + query.toString());

And it worked fine (only duplicated the query for each field).  No 
exception at all.  Of course I'm guessing on your analyzer since you 
didn't provide that detail (although it shouldn't matter in the 
exception you experienced).

On Apr 3, 2004, at 6:06 AM, [EMAIL PROTECTED] wrote:
SnowballAnalyzer(German2):

Analzying http://www.yahoo.com/foo/bar.html;
org.apache.lucene.analysis.snowball.SnowballAnalyzer:
[http] [www.yahoo.com] [foo] [bar.html]
So this is the analyzer you want to use, right?

Wildcards should work on www.yahoo.*

What is the German2 stemmer for Snowball?

You've introduced a lot of variables to your equation here 
MultiFieldQueryParser and a non-standard Snowball stemmer.  All of 
which I had to pull out of you for details, each of which is critical 
to understanding the problem.

analyzer you are using, and also do the same on .toString of the query
you parsed.  Those two pieces of info will tell all.
url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* 
url:www.yahoo*
url:www.yahoo*

Well, I actually use a MultiFieldQueryParser, that's probably why the 
term
does appear so often. Strange parser, it should be clear that am 
explicit
url:xyz should only look in the url field, shouldn't it?
Do you really need to query on multiple fields?  Why not just use the 
plain QueryParser?  If you need an aggregate field, create one at index 
time.  QueryParsing is problematic enough, but adding in MFQP makes it 
even more complicated.

Which Analyzer are you using for indexing?  This same SnowballAnalyzer 
with German2 stemmer?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Saturday 03 April 2004 15:19, Erik Hatcher wrote:
 date:[20030101 TO 20030202]

I found the/my bug. 

Since Lucene is case-sensitive, I do lower-case all queries for user's 
convenience. The ParseException is thrown because the TO becomes to.

Well, I really think Lucene needs to daff such stumbling blocks aside...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread Erik Hatcher
On Apr 3, 2004, at 9:59 AM, [EMAIL PROTECTED] wrote:
On Saturday 03 April 2004 15:19, Erik Hatcher wrote:
date:[20030101 TO 20030202]
I found the/my bug.

Since Lucene is case-sensitive, I do lower-case all queries for user's
convenience. The ParseException is thrown because the TO becomes 
to.

Well, I really think Lucene needs to daff such stumbling blocks 
aside...
No objections that error messages and such could be made clearer.  
Patches welcome!  Care to submit better error message handling in this 
case?  Or perhaps allow lower-case to?

But, also, folks need to really step back and practice basic 
troubleshooting skills.  I asked you if that string was what you passed 
to the QueryParser and you said yes, when in fact it was not.  And you 
slowly fed more details of your scenario (MFQP, some German 
SnowballAnalyzer variant).  Reduce the variables in the equation and 
narrow things down until it works and then incrementally add 
complexity.  I cannot encourage folks enough to try some JUnit 
test-driven *learning* by exploring various scenarios.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
 No objections that error messages and such could be made clearer.
 Patches welcome!  Care to submit better error message handling in this
 case?  Or perhaps allow lower-case to?

I think the best would be if Lucene would simply have a 
setCaseSensitive(boolean).

IMHO it's in any case a bad idea to make searches case-sensitive (per 
default).

 But, also, folks need to really step back and practice basic
 troubleshooting skills.  I asked you if that string was what you passed
 to the QueryParser and you said yes, when in fact it was not.  And you

I forgot that I did lower-case it. I fact I even output it in it's original 
state but lower-case it just before I pass it to lucene. That lower-casing is 
what I would call a hack and hence it's no surprise that I forgot it :-)

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread Erik Hatcher
On Apr 3, 2004, at 10:34 AM, [EMAIL PROTECTED] wrote:
I forgot that I did lower-case it. I fact I even output it in it's 
original
state but lower-case it just before I pass it to lucene. That 
lower-casing is
what I would call a hack and hence it's no surprise that I forgot it 
:-)
But why even lowercase?  That is what an analyzer typically does anyway 
(look at the output from AnalysisDemo to see).

Note that there are switches on QueryParser (and MultiFieldQueryParser 
is lacking in this respect, another reason not to use it) that does 
lowercase wildcard terms automatically: 
setLowercaseWildcardTerms(true).  Wildcard terms are not analyzed by 
QueryParser, so this was added to account for it.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zero hits for queries ending with a number

2004-04-03 Thread Tatu Saloranta
On Saturday 03 April 2004 08:34, [EMAIL PROTECTED] wrote:
 On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
  No objections that error messages and such could be made clearer.
  Patches welcome!  Care to submit better error message handling in this
  case?  Or perhaps allow lower-case to?

 I think the best would be if Lucene would simply have a
 setCaseSensitive(boolean).

 IMHO it's in any case a bad idea to make searches case-sensitive (per
 default).

I'd have to disagree. I think that  search engine core should not have to 
bother with details of character sets, such as lower-casing. Rules for 
lower/upper/initial/mixed case for all Unicode-languages are rather 
involved... and if you tried to do that, next thing would be whether 
accentuation and umlaut marks should matter or not (which is language 
dependant). That's why to me the natural way to go is to do direct 
comparison, ignoring case when executing queries. This does not prevent 
anyone from implementing such functionality (see below).

I think architecture and design of Lucene core is delightfully simple. One can 
easily create case-independent functionality by using proper analyzers, and 
(for the most part), configuring QueryParser. I would agree, however, that 
QueryParser is victim of its success; it's too often used in situations 
where one really should create proper GUI that builds the query. Backend code 
can then mangle input as it sees fit, and build query objects.
QueryParser is more natural for quick-n-dirty scenarios, where one just has to 
slap something together quickly, or if one only has textual interface to deal 
with. It's nice thing to have, but it has its limitations; there's no way to 
create one parser that's perfect for every use(r).

What could be done would be to make sure all examples / demo web apps would 
implement case-insensitive indexing and searching, since that is often what 
is needed?

-+ Tatu +-


  But, also, folks need to really step back and practice basic
  troubleshooting skills.  I asked you if that string was what you passed
  to the QueryParser and you said yes, when in fact it was not.  And you

 I forgot that I did lower-case it. I fact I even output it in it's original
 state but lower-case it just before I pass it to lucene. That lower-casing
 is what I would call a hack and hence it's no surprise that I forgot it :-)

 Timo

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread Erik Hatcher
Extremely well said, Tatu!



On Apr 3, 2004, at 11:24 AM, Tatu Saloranta wrote:
On Saturday 03 April 2004 08:34, [EMAIL PROTECTED] wrote:
On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
No objections that error messages and such could be made clearer.
Patches welcome!  Care to submit better error message handling in 
this
case?  Or perhaps allow lower-case to?
I think the best would be if Lucene would simply have a
setCaseSensitive(boolean).
IMHO it's in any case a bad idea to make searches case-sensitive (per
default).
I'd have to disagree. I think that  search engine core should not have 
to
bother with details of character sets, such as lower-casing. Rules for
lower/upper/initial/mixed case for all Unicode-languages are rather
involved... and if you tried to do that, next thing would be whether
accentuation and umlaut marks should matter or not (which is language
dependant). That's why to me the natural way to go is to do direct
comparison, ignoring case when executing queries. This does not prevent
anyone from implementing such functionality (see below).

I think architecture and design of Lucene core is delightfully simple. 
One can
easily create case-independent functionality by using proper 
analyzers, and
(for the most part), configuring QueryParser. I would agree, however, 
that
QueryParser is victim of its success; it's too often used in 
situations
where one really should create proper GUI that builds the query. 
Backend code
can then mangle input as it sees fit, and build query objects.
QueryParser is more natural for quick-n-dirty scenarios, where one 
just has to
slap something together quickly, or if one only has textual interface 
to deal
with. It's nice thing to have, but it has its limitations; there's no 
way to
create one parser that's perfect for every use(r).

What could be done would be to make sure all examples / demo web apps 
would
implement case-insensitive indexing and searching, since that is often 
what
is needed?

-+ Tatu +-


But, also, folks need to really step back and practice basic
troubleshooting skills.  I asked you if that string was what you 
passed
to the QueryParser and you said yes, when in fact it was not.  And 
you
I forgot that I did lower-case it. I fact I even output it in it's 
original
state but lower-case it just before I pass it to lucene. That 
lower-casing
is what I would call a hack and hence it's no surprise that I forgot 
it :-)

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zero hits for queries ending with a number

2004-04-02 Thread lucene
On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
 Field.Keyword is suitable for storing data like Url.  Give that a try.

I just tried this a minute ago and found that I cannot use wildcards with 
Keywords: url:www.yahoo.*

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-02 Thread Erik Hatcher
On Apr 2, 2004, at 10:00 AM, [EMAIL PROTECTED] wrote:
On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
Field.Keyword is suitable for storing data like Url.  Give that a try.
I just tried this a minute ago and found that I cannot use wildcards 
with
Keywords: url:www.yahoo.*
You *can* use wildcards with keywords (in fact, a keyword really has no 
meaning once indexed - everything is a term at that point).

99% of the issues people have with things like this end up being 
Analyzer/QueryParser related.

A few quick pieces of advice:

 - use Luke to see what is inside your index and understand what it 
looks like from the inside.
 - create a utility (I've posted one on the list in the past) that 
shows what your analyzer is doing graphically.
 - use Query.toString to output what QueryParser did to your query 
expression.

Armed with the above bits of trivia, you have the information to 
troubleshoot the situation first-hand.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Zero hits for queries ending with a number

2004-03-24 Thread Morris Mizrahi
Thanks to Otis, Morus, and Erik for their responses to my question.

I see that my question is also related to the posting: Query syntax on
Keyword field question.

I tried all of your suggestions. 
When using:
a) the tokens generated by the analyzer and
b) the parsed query (using the to_string method).
to debug StandardAnalyzer, I saw that it does properly pass in the
string with the number attached to it. I don't understand why Field.Text
did not work with StandardAnalyzer.

I tried WhitespaceAnalyzer and that did not work.

I have tried implementing a custom analyzer like KeywordAnalyzer, and
using PerFieldAnalyzerWrapper.

I think the custom analyzer I created is not properly doing what a
KeywordAnalyzer would do.

Erik, could you please post what KeywordAnalyzer should look like?

I can't wait until the book you guys are developing comes out.

Thanks very much.

   Morris


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 13, 2004 3:14 AM
To: Lucene Users List
Subject: Re: Zero hits for queries ending with a number

On Mar 13, 2004, at 6:02 AM, Morus Walter wrote:
 Otis Gospodnetic writes:
 Field.Keyword is suitable for storing data like Url.  Give that a
try.

 Hmm. I don't think keyword fields can be used with query parser,
 which is probably one of the problems here.
 He did try keyword fields.

Look in the archives for KeywordAnalyzer (custom) and 
PerFieldAnalyzerWrapper (built-in) using a combination of these you 
can use keyword fields.  Or, first try just using WhitespaceAnalyzer.

It is almost always the analyzer that is the cause of confusion - folks 
just get lulled into forgetting about its role because Lucene is so 
easy to use... until this type of issue bites you.

It is a wacky combination though - and notorious for causing confusion.

Perhaps someone could create a wiki page for this scenario where we can 
flesh out examples/solutions?

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-03-24 Thread Erik Hatcher
On Mar 24, 2004, at 5:58 PM, Morris Mizrahi wrote:
I think the custom analyzer I created is not properly doing what a
KeywordAnalyzer would do.
Erik, could you please post what KeywordAnalyzer should look like?
It should simply tokenize the entire input as a single token.  Incze 
Lajos posted a NonTokenizingTokenizer early today, in fact, that does 
the trick.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Zero hits for queries ending with a number

2004-03-24 Thread Morris Mizrahi
Thanks Erik and Incze.
Sorry for this lengthy post.

Here is the class:
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardFilter;

import java.io.Reader;

import java.util.Hashtable;

public class KeywordAnalyzer extends Analyzer {
public static final String[] STOP_WORDS =
StopAnalyzer.ENGLISH_STOP_WORDS;
private Hashtable stopTable;

public KeywordAnalyzer() {
this(STOP_WORDS);
}

public KeywordAnalyzer(String[] stopWords) {
stopTable = StopFilter.makeStopTable(stopWords);
}

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new NotTokenizingTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopTable);

return result;
}
}


I have retried everything with the new KeywordAnalyzer class,
PerFieldAnalyzerWrapper, and with Field.Keyword. I don't get results for
any searches, it doesn't even matter whether there is a number at the
end or not.

Using query.toString(url):

Query query = QueryParser.parse(terms, contents, analyzer);   
logger.info(search method: query.toString for url=  +
query.toString(url));

I can see what the analyzer is searching for.

How do I determine what is the value stored in the index by
Field.Keyword?

I've tried:

doc.add(Field.Keyword(url, url)); 
System.out.println(url: doc toString method=  +
doc.toString());

But I don't know if this is the correct value that is compared with what
the analyzer sends in.

Thanks for the help.

Morris




-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 24, 2004 4:45 PM
To: Lucene Users List
Subject: Re: Zero hits for queries ending with a number

On Mar 24, 2004, at 5:58 PM, Morris Mizrahi wrote:
 I think the custom analyzer I created is not properly doing what a
 KeywordAnalyzer would do.

 Erik, could you please post what KeywordAnalyzer should look like?

It should simply tokenize the entire input as a single token.  Incze 
Lajos posted a NonTokenizingTokenizer early today, in fact, that does 
the trick.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-03-13 Thread Erik Hatcher
On Mar 13, 2004, at 6:02 AM, Morus Walter wrote:
Otis Gospodnetic writes:
Field.Keyword is suitable for storing data like Url.  Give that a try.

Hmm. I don't think keyword fields can be used with query parser,
which is probably one of the problems here.
He did try keyword fields.
Look in the archives for KeywordAnalyzer (custom) and 
PerFieldAnalyzerWrapper (built-in) using a combination of these you 
can use keyword fields.  Or, first try just using WhitespaceAnalyzer.

It is almost always the analyzer that is the cause of confusion - folks 
just get lulled into forgetting about its role because Lucene is so 
easy to use... until this type of issue bites you.

It is a wacky combination though - and notorious for causing confusion.

Perhaps someone could create a wiki page for this scenario where we can 
flesh out examples/solutions?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Zero hits for queries ending with a number

2004-03-12 Thread Morris Mizrahi
Hey everyone.

 

My document object for my lucene index has a url field.

I have created url as a Text field.  

The problem I am having is that searches with a url that end with a
number, e.g. e:\k2_beta1, don't return any hits even though there is
data that should match this search criteria. If you have a url that
ends with a letter, e.g. e:\k2_alpha, the search works fine and
returns the correct hits.

 

Here are some code snippets of my work:

IndexCreation:

writer = new IndexWriter(index, new StandardAnalyzer(), true);

  

create the url as a Text field:

doc.add(Field.Text(url, url));

 

search code:

Analyzer analyzer = new StandardAnalyzer();

DateFilter filter = ((SearchForm) form).getDateFilter();

Searcher searcher = new
IndexSearcher(IndexReader.open(indexPath));

Query query = QueryParser.parse(terms, contents, analyzer);

Hits hits = searcher.search(query, filter);

 

I have tried changing the url field from Text to Keyword. 

This didn't work and also caused my searches for any url to fail. 

I am using lucene 1.2

 

I know I need the proper combinations of Analyzer and Field type. 

 

Any help would be appreciated.

 

Thanks.

 

Morris