Re: DateFilter on UnStored field
Following up on PA's reply. Yes, DateFilter works on *indexed* values, so whether a field is stored or not is irrelevant. Great news, thanx! However, DateFilter will not work on fields indexed as 2004-11-05. DateFilter only works on fields that were indexed using the DateField. Well, can you post here a short example? When I currently type xxx.UnStored(.. I can simply type xxx.DateField(.. ? Does it take strings like 2004-11-05? One option is to use a QueryFilter instead, filtering with a RangeQuery. I've read somewhere that classic range filtering can easily exceed the maximum number of boolean query clauses. I need to filter a very large range of dates with day accuracy and I don't want to increase the max. clause count to very high values. So, I decided to use DateFilter which has no such problems AFAIK. How much impact does DateFilter have on search times? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - now with 250MB free storage. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateFilter on UnStored field
On Feb 14, 2005, at 6:27 AM, Sanyi wrote: However, DateFilter will not work on fields indexed as 2004-11-05. DateFilter only works on fields that were indexed using the DateField. Well, can you post here a short example? When I currently type xxx.UnStored(.. I can simply type xxx.DateField(.. ? Does it take strings like 2004-11-05? DateField has a utility method to return a String: DateField.timeToString(file.lastModified()) You'd use that String to pass to Field.UnStored. I recommend, though, that you use a different format, such as the -MM-DD format you're using. One option is to use a QueryFilter instead, filtering with a RangeQuery. I've read somewhere that classic range filtering can easily exceed the maximum number of boolean query clauses. I need to filter a very large range of dates with day accuracy and I don't want to increase the max. clause count to very high values. So, I decided to use DateFilter which has no such problems AFAIK. Right! In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter which would do the trick for you. If you want to stick with Lucene 1.4.x, that's fine... just grab the code for that filter and use it as a custom filter - its compatible with 1.4.x. How much impact does DateFilter have on search times? It depends on whether you instantiate a new filter for each search. Building a filter requires scanning through the terms in the index to build BitSet for the documents that fall in that range. Filters are best used over multiple searches. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
tf -idf showing the scores beside each hit
hi is it possible to show a tf idf score beside each hit Eg i type in a word as a query for example the word free and each file with the word free is named but i would like the tf idf score to appear beside it? like this 0. file1.txt tf idf score = 2.16543 is it possible?? __ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
chained restrictive queries
Hi, I'm currently working on application using Lucene 1.3 , and have to improve the current indexation/search methods with the 1.4.3 version. I was thinking to use the FilteredQuery object to refine my chained queries but, after some tests, performances are worst :(. The chained queries were like : - a first boolean query to retrieve a set of doc id matching some criterias - a second query applying a fuzzy criteria to refine it more deeply. My index contains like 7 millions of document at all , and first query should retrieve, at maximum, like 50 000 documents. I'm currently working with crossed indexes while doing searches , but i want to remove the extra indexes and do all things with only one. So, is it possible to use the FilteredQuery object or another one to chain queries from the most restrictive to the most open one ? Thx for your help Sincerely, Olivier PS : sorry for all mistakes :o) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateFilter on UnStored field
DateField has a utility method to return a String: DateField.timeToString(file.lastModified()) You'd use that String to pass to Field.UnStored. I recommend, though, that you use a different format, such as the -MM-DD format you're using. Well, I read -MM-DD format string from a database. So, I need to know how to convert -MM-DD to DateField.timeToString()'s result format. Or I have to convert -MM-DD to file.lastModified()'s format which I can pass to DateField.timeToString(). What is the easiest solution? In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter which would do the trick for you. If you want to stick with Lucene 1.4.x, that's fine... just grab the code for that filter and use it as a custom filter - its compatible with 1.4.x. So, why do you recommend RangeFilter over DateFilter? Does it require less index data or/and has it better performance? (I'm using 1.4.2) It depends on whether you instantiate a new filter for each search. Building a filter requires scanning through the terms in the index to build BitSet for the documents that fall in that range. Filters are best used over multiple searches. Simply saying: I let the user to enter the search string on a HTML form, then I call my custom lucene-based java class through command line (the calling method may change to the PHP-to-JAVA bridge if it'll be perfect for my needs). So, every search is a whole new round. New HTML FORM post - new command line JVM call - new index searcher, etc... The OS is caching the index file pretty well (only the memory size is the limit of course). Will my implementation's performance drop down a lot when I implement DateFilter? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What does [] do to a query and what's up with lucene.apache.org?
First I'm getting a The requested URL could not be retrieved While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered is at line 1, column 15. Was expecting: ] ... when I tried to parse the following string [this is a test]. I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does [] do to a query and what's up with lucene.apache.org?
Hi, lucene.apache.org seems to work now. Here is the query syntax: http://lucene.apache.org/queryparsersyntax.html [] is used as [BEGIN-RANGE-STRING TO END-RANGE-STRING] Otis --- Jim Lynch [EMAIL PROTECTED] wrote: First I'm getting a The requested URL could not be retrieved While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered is at line 1, column 15. Was expecting: ] ... when I tried to parse the following string [this is a test]. I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does [] do to a query and what's up with lucene.apache.org?
Jim, The Lucene website is transitioning to the new top-level space. I have checked out the current site to the new lucene.apache.org area and set up redirects from the old Jakarta URL's. The source code, though, is not an official part of the website. Thanks to our conversion to Subversion, though, the source is browsable starting here: http://svn.apache.org/repos/asf/lucene/java/trunk The HTML of the website will need link adjustments to get everything back in shape. The brackets are documented here: http://lucene.apache.org/queryparsersyntax.html Erik On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote: First I'm getting a The requested URL could not be retrieved --- - While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered is at line 1, column 15. Was expecting: ] ... when I tried to parse the following string [this is a test]. I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does [] do to a query and what's up with lucene.apache.org?
Otis and Erik, Thanks for the info. That's a great reference. Jim. Erik Hatcher wrote: Jim, The Lucene website is transitioning to the new top-level space. I have checked out the current site to the new lucene.apache.org area and set up redirects from the old Jakarta URL's. The source code, though, is not an official part of the website. Thanks to our conversion to Subversion, though, the source is browsable starting here: http://svn.apache.org/repos/asf/lucene/java/trunk The HTML of the website will need link adjustments to get everything back in shape. The brackets are documented here: http://lucene.apache.org/queryparsersyntax.html Erik On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote: First I'm getting a The requested URL could not be retrieved --- - While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered is at line 1, column 15. Was expecting: ] ... when I tried to parse the following string [this is a test]. I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?
I was trying to write some documentation on how to use the tool and issued a search for: contact:DENNIS MORROW And sure enough I got 647 hits. Then I changed the searc to: contact:DENNIS MORRO? And now I get 648 hits, but in some of them the contact doesn't even remotely resemble the search pattern. For instance here are the what the contact fields contain for some of these hits: Contact: GENERIC CONTACT Contact: Andre Gardinalli Contact: Brett Morrow (that's especially interesting) Contact: KEN PATTERSON And of course there are some with Dennis' name too. Any idea why this is happening? I'm using the QueryParser.parse method. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?
On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote: I was trying to write some documentation on how to use the tool and issued a search for: contact:DENNIS MORROW Is that literally the QueryParser string you entered? If so, that parses to: contact:DENNIS OR defaultField:MORROW most likely. And now I get 648 hits, but in some of them the contact doesn't even remotely resemble the search pattern. For instance here are the what the contact fields contain for some of these hits: Contact: GENERIC CONTACT Contact: Andre Gardinalli Contact: Brett Morrow (that's especially interesting) Contact: KEN PATTERSON And of course there are some with Dennis' name too. Any idea why this is happening? I'm using the QueryParser.parse method. I'm not sure you'll be able to do this with QueryParser with spaces in an untokenized field. First try it with an API created WildcardQuery to be sure it works the way you expect. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?
Erik Hatcher wrote: On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote: I was trying to write some documentation on how to use the tool and issued a search for: contact:DENNIS MORROW Is that literally the QueryParser string you entered? If so, that parses to: contact:DENNIS OR defaultField:MORROW most likely. Ah! Good point. And now I get 648 hits, but in some of them the contact doesn't even remotely resemble the search pattern. For instance here are the what the contact fields contain for some of these hits: Contact: GENERIC CONTACT Contact: Andre Gardinalli Contact: Brett Morrow (that's especially interesting) Contact: KEN PATTERSON And of course there are some with Dennis' name too. Any idea why this is happening? I'm using the QueryParser.parse method. I'm not sure you'll be able to do this with QueryParser with spaces in an untokenized field. First try it with an API created WildcardQuery to be sure it works the way you expect. I didn't really have any expectations other than what I saw didn't make sense. I'll just add to the docs that [this set of fields] can't be searched with wildcards. Thanks, Jim. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Limiting Hits with a score threshold
Does anyone have an example of limiting results returned based on a score threshold? For example if I'm only interested in documents with a score 0.05. Thanks, -Jay - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
Hi again, So is SqlDirectory recommended for use in a cluster to workaround the accessibility problem, or are people using NFS or a standalone server instead? Thanks in advance, PJ --- Paul Jans [EMAIL PROTECTED] wrote: I've already ordered Lucene in Action :) There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ I will keep an eye on that for sure. You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) We're already using Oracle, so would it be possible to store the index there, thus giving each cluster node easy access to it. I read about SqlDirectory in the archives but it looks like it didn't make it to the API and I don't see it on the contrib page. I'm more concerned about making the index accessible rather than transactional consistency, so NFS may be another option like you mention. I'm curious to hear about other systems which are clustered and how others are doing this; lessons learnt and best practices etc. Thanks again for the help. Lucene looks like a first class tool. PJ --- Erik Hatcher [EMAIL PROTECTED] wrote: On Feb 10, 2005, at 5:00 PM, Paul Jans wrote: A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. Don't forget to get your copy of Lucene in Action too :) 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) However, most projects do fine with cruder techniques such as sharing the Lucene index on a common drive and ensuring that locking is configured to use the common drive also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? The all-new My Yahoo! - What will yours do? http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: chained restrictive queries
On Monday 14 February 2005 15:14, [EMAIL PROTECTED] wrote: Hi, I'm currently working on application using Lucene 1.3 , and have to improve the current indexation/search methods with the 1.4.3 version. I was thinking to use the FilteredQuery object to refine my chained queries but, after some tests, performances are worst :(. The chained queries were like : - a first boolean query to retrieve a set of doc id matching some criterias A FilteredQuery works best when the filter from the criterias can be reused, eg. by keeping it in a cache, possibly with CachingWrapperFilter. - a second query applying a fuzzy criteria to refine it more deeply. My index contains like 7 millions of document at all , and first query should retrieve, at maximum, like 50 000 documents. I'm currently working with crossed indexes while doing searches , but i want to remove the extra indexes and do all things with only one. So, is it possible to use the FilteredQuery object or another one to chain queries from the most restrictive to the most open one ? It is possible, but whether it helps performance depends on your circumstances. The 1.4.3 filter implementation executes the most open query almost completely. It only applies the filter after the score computations for the query being filtered, just before deciding whether to keep the docment in the query results. This is done in IndexSearcher.search(). A profiler might tell you whether that is a bottleneck for your queries. If it is, there is some code in development that might help . In case it turns out that the memory occupied by the BitSet of the filter is a bottleneck, please check the (very) recent archives of lucene-dev on BitSet implementation. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Limiting Hits with a score threshold
I would not recommend doing this because absolute score values in Lucene are not meaningful (e.g., scores are not directly comparable across searches). The ratio of a score to the highest score returned is meaningful, but there is no absolute calibration for the highest score returned, at least at present, so there is not a way to determine from the scores what the quality of the result set is overall. There are various approaches to improving this that have been discussed (making the scores more directly comparable by encoding additional information into the score and using that for normalization, or probably better, generalizing the score to an object that contains multiple pieces of information; e.g. the total number of query terms matched by the top result if you are using default OR would be quite useful). None of these ideas are implemented yet as far as I know. Chuck -Original Message- From: Jay Hill [mailto:[EMAIL PROTECTED] Sent: Monday, February 14, 2005 11:08 AM To: lucene-user@jakarta.apache.org Subject: Limiting Hits with a score threshold Does anyone have an example of limiting results returned based on a score threshold? For example if I'm only interested in documents with a score 0.05. Thanks, -Jay - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
On Feb 14, 2005, at 2:40 PM, Paul Jans wrote: Hi again, So is SqlDirectory recommended for use in a cluster to workaround the accessibility problem, or are people using NFS or a standalone server instead? Neither. As far as I know, Berkeley DB is the only viable DB implementation currently. NFS has notoriously had issues with Lucene and file locking. Search the archives for more details on this. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Numbers in Index
hi, actually i'm using standard analyzer during my index process. but when i browse the index with luke there also numbers inside. which analyzer should i use to eliminate this from my index or should i specify this in my stopword list? thx miro ___ Gesendet von Yahoo! Mail - Jetzt mit 250MB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numbers in Index
On Feb 14, 2005, at 4:32 PM, Miro Max wrote: actually i'm using standard analyzer during my index process. but when i browse the index with luke there also numbers inside. which analyzer should i use to eliminate this from my index or should i specify this in my stopword list? Don't use a stop word list to remove numbers. You could do a couple of things use SimpleAnalyzer, or write a custom analyzer which uses the parts of StandardAnalyzer and applies a number removal filter at the end. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]