Re: DateFilter on UnStored field

2005-02-14 Thread Sanyi
 Following up on PA's reply.  Yes, DateFilter works on *indexed* values, 
 so whether a field is stored or not is irrelevant.

Great news, thanx!

 However, DateFilter will not work on fields indexed as 2004-11-05.  
 DateFilter only works on fields that were indexed using the DateField.

Well, can you post here a short example?
When I currently type xxx.UnStored(.. I can simply type xxx.DateField(.. ?
Does it take strings like 2004-11-05?

 One option is to use a QueryFilter instead, filtering with a 
 RangeQuery.

I've read somewhere that classic range filtering can easily exceed the maximum 
number of boolean
query clauses. I need to filter a very large range of dates with day accuracy 
and I don't want to
increase the max. clause count to very high values. So, I decided to use 
DateFilter which has no
such problems AFAIK.

How much impact does DateFilter have on search times?

Regards,
Sanyi



__ 
Do you Yahoo!? 
Yahoo! Mail - now with 250MB free storage. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateFilter on UnStored field

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 6:27 AM, Sanyi wrote:
However, DateFilter will not work on fields indexed as 2004-11-05.
DateFilter only works on fields that were indexed using the DateField.
Well, can you post here a short example?
When I currently type xxx.UnStored(.. I can simply type 
xxx.DateField(.. ?
Does it take strings like 2004-11-05?
DateField has a utility method to return a String:
DateField.timeToString(file.lastModified())
You'd use that String to pass to Field.UnStored.
I recommend, though, that you use a different format, such as the 
-MM-DD format you're using.

One option is to use a QueryFilter instead, filtering with a
RangeQuery.
I've read somewhere that classic range filtering can easily exceed the 
maximum number of boolean
query clauses. I need to filter a very large range of dates with day 
accuracy and I don't want to
increase the max. clause count to very high values. So, I decided to 
use DateFilter which has no
such problems AFAIK.
Right!
In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter 
which would do the trick for you.  If you want to stick with Lucene 
1.4.x, that's fine... just grab the code for that filter and use it as 
a custom filter - its compatible with 1.4.x.

How much impact does DateFilter have on search times?
It depends on whether you instantiate a new filter for each search.  
Building a filter requires scanning through the terms in the index to 
build BitSet for the documents that fall in that range.  Filters are 
best used over multiple searches.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


tf -idf showing the scores beside each hit

2005-02-14 Thread *Clodagh*

hi

is it possible to show a tf idf score beside each hit 

Eg i type in a word as a query for example the word
free and each file with the word free is named but i
would like the tf idf score to appear beside it?

like this

0. file1.txt tf idf score = 2.16543

is it possible??




__ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



chained restrictive queries

2005-02-14 Thread oquinton
Hi,

I'm currently working on application using Lucene 1.3 , and have to improve
the current indexation/search methods with the 1.4.3 version.


I was thinking to use the FilteredQuery object to refine my chained queries
but, after some tests, performances are worst :(.

The chained queries were like :
- a first boolean query to retrieve a set of doc id matching some criterias
- a second query applying a fuzzy criteria to refine it more deeply.

My index contains like 7 millions of document at all , and first query
should retrieve, at maximum, like 50 000 documents.

I'm currently working with crossed indexes while doing searches , but i
want to remove the extra indexes and do all things with only one.

So, is it possible to use the FilteredQuery object or another one to chain
queries from the most restrictive to the most open one ?

Thx for your help

Sincerely,

Olivier

PS : sorry for all mistakes :o)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateFilter on UnStored field

2005-02-14 Thread Sanyi
 DateField has a utility method to return a String:
 
   DateField.timeToString(file.lastModified())
 
 You'd use that String to pass to Field.UnStored.
 
 I recommend, though, that you use a different format, such as the 
 -MM-DD format you're using.

Well, I read -MM-DD format string from a database.
So, I need to know how to convert -MM-DD to DateField.timeToString()'s 
result format.
Or I have to convert -MM-DD to file.lastModified()'s format which I can 
pass to
DateField.timeToString().
What is the easiest solution?

 In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter 
 which would do the trick for you.  If you want to stick with Lucene 
 1.4.x, that's fine... just grab the code for that filter and use it as 
 a custom filter - its compatible with 1.4.x.

So, why do you recommend RangeFilter over DateFilter?
Does it require less index data or/and has it better performance?
(I'm using 1.4.2)

 It depends on whether you instantiate a new filter for each search.  
 Building a filter requires scanning through the terms in the index to 
 build BitSet for the documents that fall in that range.  Filters are 
 best used over multiple searches.

Simply saying:
I let the user to enter the search string on a HTML form, then I call my custom 
lucene-based java
class through command line (the calling method may change to the PHP-to-JAVA 
bridge if it'll be
perfect for my needs).
So, every search is a whole new round. New HTML FORM post - new command line 
JVM call - new
index searcher, etc...

The OS is caching the index file pretty well (only the memory size is the limit 
of course).

Will my implementation's performance drop down a lot when I implement 
DateFilter?

Regards,
Sanyi



__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Jim Lynch
First I'm getting a
   The requested URL could not be retrieved

While trying to retrieve the URL: 
http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java 

The following error was encountered:
   Unable to determine IP address from host name for /lucene.apache.org
   /Guess the system is down.
I'm getting this error:
org.apache.lucene.queryParser.ParseException: Encountered is at line 
1, column 15.
Was expecting:
   ] ...
when I tried to parse the following string [this is a test].

I can't find any documentation that tells me what the brackets do to a 
query.  I had a user that was used to another search engine that used [] 
to do proximity or near searches and tried it on this one. Actually I'd 
like to see the documentation for what the parser does.  All that is 
mentioned in the javadoc is + - and ().  Obviously there are more 
special characters.

Thanks,
Jim.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Otis Gospodnetic
Hi,

lucene.apache.org seems to work now.
Here is the query syntax:
  http://lucene.apache.org/queryparsersyntax.html
[] is used as [BEGIN-RANGE-STRING TO END-RANGE-STRING]

Otis



--- Jim Lynch [EMAIL PROTECTED] wrote:

 First I'm getting a
 
 
 The requested URL could not be retrieved
 


 
 While trying to retrieve the URL: 

http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java
 
 
 
 The following error was encountered:
 
 Unable to determine IP address from host name for
 /lucene.apache.org
 
 /Guess the system is down.
 
 I'm getting this error:
 
 org.apache.lucene.queryParser.ParseException: Encountered is at
 line 
 1, column 15.
 Was expecting:
 ] ...
  when I tried to parse the following string [this is a test].
 
 I can't find any documentation that tells me what the brackets do to
 a 
 query.  I had a user that was used to another search engine that used
 [] 
 to do proximity or near searches and tried it on this one. Actually
 I'd 
 like to see the documentation for what the parser does.  All that is 
 mentioned in the javadoc is + - and ().  Obviously there are more 
 special characters.
 
 Thanks,
 Jim.
 
 Jim.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Erik Hatcher
Jim,
The Lucene website is transitioning to the new top-level space.  I have  
checked out the current site to the new lucene.apache.org area and set  
up redirects from the old Jakarta URL's.  The source code, though, is  
not an official part of the website.  Thanks to our conversion to  
Subversion, though, the source is browsable starting here:

http://svn.apache.org/repos/asf/lucene/java/trunk
The HTML of the website will need link adjustments to get everything  
back in shape.

The brackets are documented here:  
http://lucene.apache.org/queryparsersyntax.html

Erik
On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote:
First I'm getting a
   The requested URL could not be retrieved
--- 
-

While trying to retrieve the URL:  
http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ 
TestQueryParser.java

The following error was encountered:
   Unable to determine IP address from host name for /lucene.apache.org
   /Guess the system is down.
I'm getting this error:
org.apache.lucene.queryParser.ParseException: Encountered is at line  
1, column 15.
Was expecting:
   ] ...
when I tried to parse the following string [this is a test].

I can't find any documentation that tells me what the brackets do to a  
query.  I had a user that was used to another search engine that used  
[] to do proximity or near searches and tried it on this one. Actually  
I'd like to see the documentation for what the parser does.  All that  
is mentioned in the javadoc is + - and ().  Obviously there are more  
special characters.

Thanks,
Jim.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Jim Lynch
Otis and Erik,
Thanks for the info.  That's a great reference.
Jim.
Erik Hatcher wrote:
Jim,
The Lucene website is transitioning to the new top-level space.  I 
have  checked out the current site to the new lucene.apache.org area 
and set  up redirects from the old Jakarta URL's.  The source code, 
though, is  not an official part of the website.  Thanks to our 
conversion to  Subversion, though, the source is browsable starting here:

http://svn.apache.org/repos/asf/lucene/java/trunk
The HTML of the website will need link adjustments to get everything  
back in shape.

The brackets are documented here:  
http://lucene.apache.org/queryparsersyntax.html

Erik
On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote:
First I'm getting a
   The requested URL could not be retrieved
--- 
-

While trying to retrieve the URL:  
http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ 
TestQueryParser.java

The following error was encountered:
   Unable to determine IP address from host name for /lucene.apache.org
   /Guess the system is down.
I'm getting this error:
org.apache.lucene.queryParser.ParseException: Encountered is at 
line  1, column 15.
Was expecting:
   ] ...
when I tried to parse the following string [this is a test].

I can't find any documentation that tells me what the brackets do to 
a  query.  I had a user that was used to another search engine that 
used  [] to do proximity or near searches and tried it on this one. 
Actually  I'd like to see the documentation for what the parser 
does.  All that  is mentioned in the javadoc is + - and ().  
Obviously there are more  special characters.

Thanks,
Jim.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?

2005-02-14 Thread Jim Lynch
I was trying to write some documentation on how to use the tool and 
issued a search for:

contact:DENNIS MORROW
And sure enough I got 647 hits.  Then I changed the searc to:
contact:DENNIS MORRO?
And now I get 648 hits, but in some of them the contact doesn't even 
remotely resemble the search pattern.  For instance here are the what 
the contact fields contain for some of these hits:
Contact: GENERIC CONTACT
Contact: Andre Gardinalli
Contact: Brett Morrow  (that's especially interesting)
Contact: KEN PATTERSON

And of course there are some with Dennis' name too.
Any idea why this is happening?  I'm using the QueryParser.parse method.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote:
I was trying to write some documentation on how to use the tool and 
issued a search for:

contact:DENNIS MORROW
Is that literally the QueryParser string you entered?  If so, that 
parses to:

contact:DENNIS OR defaultField:MORROW
most likely.
And now I get 648 hits, but in some of them the contact doesn't even 
remotely resemble the search pattern.  For instance here are the what 
the contact fields contain for some of these hits:
Contact: GENERIC CONTACT
Contact: Andre Gardinalli
Contact: Brett Morrow  (that's especially interesting)
Contact: KEN PATTERSON

And of course there are some with Dennis' name too.
Any idea why this is happening?  I'm using the QueryParser.parse 
method.
I'm not sure you'll be able to do this with QueryParser with spaces in 
an untokenized field.  First try it with an API created WildcardQuery 
to be sure it works the way you expect.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?

2005-02-14 Thread Jim Lynch

Erik Hatcher wrote:
On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote:
I was trying to write some documentation on how to use the tool and 
issued a search for:

contact:DENNIS MORROW

Is that literally the QueryParser string you entered?  If so, that 
parses to:

contact:DENNIS OR defaultField:MORROW
most likely.
Ah! Good point.

And now I get 648 hits, but in some of them the contact doesn't even 
remotely resemble the search pattern.  For instance here are the what 
the contact fields contain for some of these hits:
Contact: GENERIC CONTACT
Contact: Andre Gardinalli
Contact: Brett Morrow  (that's especially interesting)
Contact: KEN PATTERSON

And of course there are some with Dennis' name too.
Any idea why this is happening?  I'm using the QueryParser.parse method.

I'm not sure you'll be able to do this with QueryParser with spaces in 
an untokenized field.  First try it with an API created WildcardQuery 
to be sure it works the way you expect.
I didn't really have any expectations other than what I saw didn't make 
sense.  I'll just add to the docs that [this set of fields] can't be 
searched with wildcards. 

Thanks,
Jim.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Limiting Hits with a score threshold

2005-02-14 Thread Jay Hill
Does anyone have an example of limiting results returned based on a
score threshold? For example if I'm only interested in documents with
a score  0.05.

Thanks,
-Jay

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie questions

2005-02-14 Thread Paul Jans
Hi again,

So is SqlDirectory recommended for use in a cluster to
workaround the accessibility problem, or are people
using NFS or a standalone server instead?

Thanks in advance,
PJ

--- Paul Jans [EMAIL PROTECTED] wrote:

 I've already ordered Lucene in Action :)
 
  There is a LuceneRAR project that is still in its
  infancy here: 
  https://lucenerar.dev.java.net/
 
 I will keep an eye on that for sure.
 
  You can also store a Lucene index in Berkeley DB
  (look at the 
  /contrib/db area of the source code repository)
 
 We're already using Oracle, so would it be possible
 to
 store the index there, thus giving each cluster node
 easy access to it. I read about SqlDirectory in the
 archives but it looks like it didn't make it to the
 API and I don't see it on the contrib page.
 
 I'm more concerned about making the index accessible
 rather than transactional consistency, so NFS may be
 another option like you mention. I'm curious to hear
 about other systems which are clustered and how
 others
 are doing this; lessons learnt and best practices
 etc.
 
 Thanks again for the help. Lucene looks like a first
 class tool.
 
 PJ
 
 --- Erik Hatcher [EMAIL PROTECTED] wrote:
 
  
  On Feb 10, 2005, at 5:00 PM, Paul Jans wrote:
   A couple of newbie questions. I've searched the
   archives and read the Javadoc but I'm still
 having
   trouble figuring these out.
  
  Don't forget to get your copy of Lucene in
 Action
  too :)
  
   1. What's the best way to index and handle
 queries
   like the following:
  
   Find me all users with (a CS degree and a GPA 
  3.0)
   or (a Math degree and a GPA  3.5).
  
  Some suggestions:  index degree as a Keyword
 field. 
  Pad GPA, so that 
  all of them are the form #.# (or #.## maybe). 
  Numerics need to be 
  lexicographically ordered, and thus padded.
  
  With the right analyzer (see the AnalysisParalysis
  page on the wiki) 
  you could use this type of query with
 QueryParser:'
  
  degree:cs AND gpa:[3.0 TO 9.9]
  
   2. What are the best practices for using Lucene
 in
  a
   clustered J2EE environment? A standalone
  index/search
   server or storing the index in the database or
   something else ?
  
  There is a LuceneRAR project that is still in its
  infancy here: 
  https://lucenerar.dev.java.net/
  
  You can also store a Lucene index in Berkeley DB
  (look at the 
  /contrib/db area of the source code repository)
  
  However, most projects do fine with cruder
  techniques such as sharing 
  the Lucene index on a common drive and ensuring
 that
  locking is 
  configured to use the common drive also.
  
  Erik
  
  
 

-
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Helps protect you from nasty viruses. 
 http://promotions.yahoo.com/new_mail
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: chained restrictive queries

2005-02-14 Thread Paul Elschot
On Monday 14 February 2005 15:14, [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm currently working on application using Lucene 1.3 , and have to improve
 the current indexation/search methods with the 1.4.3 version.
 
 
 I was thinking to use the FilteredQuery object to refine my chained queries
 but, after some tests, performances are worst :(.

 The chained queries were like :
 - a first boolean query to retrieve a set of doc id matching some criterias

A FilteredQuery works best when the filter from the criterias can be reused,
eg. by keeping it in a cache, possibly with CachingWrapperFilter.

 - a second query applying a fuzzy criteria to refine it more deeply.
 
 My index contains like 7 millions of document at all , and first query
 should retrieve, at maximum, like 50 000 documents.

 I'm currently working with crossed indexes while doing searches , but i
 want to remove the extra indexes and do all things with only one.
 
 So, is it possible to use the FilteredQuery object or another one to chain
 queries from the most restrictive to the most open one ?

It is possible, but whether it helps performance depends on your
circumstances.

The 1.4.3 filter implementation executes the most open query almost
completely.
It only applies the filter after the score computations for the
query being filtered, just before deciding whether to keep the docment
in the query results.
This is done in IndexSearcher.search(). 
A profiler might tell you whether that is a bottleneck for your queries.
If it is, there is some code in development that might help
.
In case it turns out that the memory occupied by the BitSet of the filter
is a bottleneck, please check the (very) recent archives of lucene-dev
on BitSet implementation.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Limiting Hits with a score threshold

2005-02-14 Thread Chuck Williams
I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches).  The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall.  There are
various approaches to improving this that have been discussed (making
the scores more directly comparable by encoding additional information
into the score and using that for normalization, or probably better,
generalizing the score to an object that contains multiple pieces of
information; e.g. the total number of query terms matched by the top
result if you are using default OR would be quite useful).  None of
these ideas are implemented yet as far as I know.

Chuck

   -Original Message-
   From: Jay Hill [mailto:[EMAIL PROTECTED]
   Sent: Monday, February 14, 2005 11:08 AM
   To: lucene-user@jakarta.apache.org
   Subject: Limiting Hits with a score threshold
   
   Does anyone have an example of limiting results returned based on a
   score threshold? For example if I'm only interested in documents
with
   a score  0.05.
   
   Thanks,
   -Jay
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie questions

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 2:40 PM, Paul Jans wrote:
Hi again,
So is SqlDirectory recommended for use in a cluster to
workaround the accessibility problem, or are people
using NFS or a standalone server instead?
Neither.  As far as I know, Berkeley DB is the only viable DB 
implementation currently.

NFS has notoriously had issues with Lucene and file locking.  Search 
the archives for more details on this.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Numbers in Index

2005-02-14 Thread Miro Max
hi,

actually i'm using standard analyzer during my index
process. but when i browse the index with luke there
also numbers inside.

which analyzer should i use to eliminate this from my
index or should i specify this in my stopword list?

thx

miro






___ 
Gesendet von Yahoo! Mail - Jetzt mit 250MB Speicher kostenlos - Hier anmelden: 
http://mail.yahoo.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numbers in Index

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 4:32 PM, Miro Max wrote:
actually i'm using standard analyzer during my index
process. but when i browse the index with luke there
also numbers inside.
which analyzer should i use to eliminate this from my
index or should i specify this in my stopword list?
Don't use a stop word list to remove numbers.  You could do a couple of 
things use SimpleAnalyzer, or write a custom analyzer which uses 
the parts of StandardAnalyzer and applies a number removal filter at 
the end.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]