Re: Dates and others
On Monday, December 1, 2003, at 11:55 PM, Tatu Saloranta wrote: On a related note, it would also be nice if there was a way to start categorizing general hot topics for Lucene developers; it seems like there are about half a dozen areas where there's lots of interest for improvements (most of them related to ranking). If so, perhaps there could be more specific discussion groups, and also perhaps web pages summarizing some of discussions, consensus achieved, even if there's no code to show for it? I agree. Sounds like the perfect solution is a wiki! Just happens we have one: http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneProjectPages Have at it! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
Dion Almaer wrote: The only real item that I still want to tweak more is getting recent results higher in the list. I was wondering if something like this could work (or if there is a better solution) At index time, I have the date of the content. I could do some math where the higher the date (based on the time_t version or whatever) the more of a setBoost(metric). Or, for every month in the past, create a larger negative number to setBoost()... or something like that. Would something like this make sense? The problem with this approach is that eventually you'll exhaust the range of the boost. So this will only work if you re-index things from scratch periodically, with a boost of something like 1/days-ago. If you're adding documents to the index in date order, then you could use a HitCollector which adjusts scores according to the document number, since document numbers increase as you add to the index. If you're not adding things in date order, then you can, when you open the index, build an array mapping document numbers to integer dates. Then your hit collector can use this to either boost or sort hits by date. Or you could add a month or week field to documents, then add it as a clause to your queries with a boost. Then documents matching the most recent week(s) and/or month(s) would get the boost. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Dates and others
ad hoc techniques run into lots of trouble because the requirement on Lucene isn't well specified. is a document with one of the search terms that is a week newer enough to move it ahead of a document that has all of the search terms? the boost mechanism is a way to move documents around in the ranking list, but it clearly is a way to reweight the importance of the query terms and not to impose external constraints that properly should be handled outside the search engine. Herb... -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, December 01, 2003 1:11 PM To: Lucene Users List Subject: Re: Dates and others The problem with this approach is that eventually you'll exhaust the range of the boost. So this will only work if you re-index things from scratch periodically, with a boost of something like 1/days-ago. If you're adding documents to the index in date order, then you could use a HitCollector which adjusts scores according to the document number, since document numbers increase as you add to the index. If you're not adding things in date order, then you can, when you open the index, build an array mapping document numbers to integer dates. Then your hit collector can use this to either boost or sort hits by date. Or you could add a month or week field to documents, then add it as a clause to your queries with a boost. Then documents matching the most recent week(s) and/or month(s) would get the boost. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
On Monday 01 December 2003 15:13, Dion Almaer wrote: ... Interesting. I implemented an approach which boosted based on the number of months in the past, and after tweaking the boost amounts, it seems to do the job. I do a fresh reindex every night (since the indexing process takes no time at all... unlike our old search solution!) This sounds interesting, as I have been thinking of what's the best way to boost newer documents. Can you share some of your experience regarding boost values that seemed to make sense? In my case, CMS I'm working on stores support documentation for software/hardware, meaning that content is highly time-sensitive (ie. documents decay pretty quickly). Since the system is already doing both incremental reindexing, and nightly full reindexing (latter to make sure that even if temporarily some changed content was not [fully] reindexed, it eventually gets indexed properly), I can fairly easily add boosting I think. On a related note, it would also be nice if there was a way to start categorizing general hot topics for Lucene developers; it seems like there are about half a dozen areas where there's lots of interest for improvements (most of them related to ranking). If so, perhaps there could be more specific discussion groups, and also perhaps web pages summarizing some of discussions, consensus achieved, even if there's no code to show for it? -+ Tatu +- I read content for the index from different sources. Sometimes the source gives me documents loosely in date order, but not all of them. So, it seems that one of the other approaches should be taken (adding a month/week field etc). I should look more into the HitCollector and see how it can help me. The other issue I have is that I would like to prioritize the title field. At the moment I am lazy and add the title to the body (contents = title + body) which seems to be OK... however sometimes something that mentions the search term in the title should appear higher up in the pecking order. I am using the QueryParser (subclassed to disallow wildcards etc) to do the dirty work for me. Should I get away from this and manage the queries myself (and run a Multi against the title field as well as the contents? Thanks for the great feedback, Dion - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Dates and others
Hi guys - So I am getting happier with search, and just pushed the lucene version live at: http://www.theserverside.com (on the leftbar) and: http://www.theserverside.com/home/search/index.jsp The only real item that I still want to tweak more is getting recent results higher in the list. I was wondering if something like this could work (or if there is a better solution) At index time, I have the date of the content. I could do some math where the higher the date (based on the time_t version or whatever) the more of a setBoost(metric). Or, for every month in the past, create a larger negative number to setBoost()... or something like that. Would something like this make sense? Dion -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, November 23, 2003 3:52 PM To: Lucene Users List Subject: Re: Dates and others On Saturday, November 22, 2003, at 06:33 PM, Dion Almaer wrote: 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search this meta data and give them certain weights? Or, if you go down this path do you have to leave the QueryParser behind and build your own queries? Any best practices would be great. And Ype said: You can provide field weights at document indexing time (norms) and use a MultiTermQuery for searching multiple fields. At query time you can again use field weights. I don't know how the scoring of the MultiTermQuery is done, it might use the max. score over the fields of a document, or combine the scores in the fields of a document. end Ype's reply cut and paste I'm a little confused with this question and Ype's reply. MultiTermQuery is an abstract base class under Query, which is the parent for WildcardQuery and FuzzyQuery. What I think you're after is using MultiFieldQueryParser, but you want to weight the fields differently. You can add the boosts at indexing time using Field.setBoost. Unfortunately at the moment MultiFieldQueryParser is not very extensible - there are some open issues with its subclassability but subclassing MFQP and overriding getFieldQuery will do the trick when the subclassing issues are resolved allowing you to boost at query time. Making an educated guess at what you're doing with Lucene, Dion, I'd venture to say that boosting at indexing time is sufficient for your needs. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Dates and others
Erik - Spot on. I should have listened to your advice from the talk and just used MMDD :) Everything works nicely now that I do the conversion. Thanks for the great ideas. Dion -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, November 23, 2003 11:41 PM To: Lucene Users List Subject: Re: Dates and others On Sunday, November 23, 2003, at 03:33 PM, Dion Almaer wrote: This leads me to another issue actually. On certain range queries I get exceptions: Query: modifieddate:[1/1/03 TO 12/31/03] org.apache.lucene.search.BooleanQuery$TooManyClauses I'm guessing you're using Field.Keyword(String, Date) for modifieddate? The date field stuff in Lucene is really a timestamp, and doing a range query enumerates all the terms for that field in that ranging making a big ol' boolean OR query of all the individual ones. Since you want this to be just a date, use Field.Keyword(String, MMDD) instead. But you'll want to subclass QueryParser and override getRangeQuery to do the right date format parsing from MM/DD/ into MMDD rather than the internal Date representation Lucene uses for date fields. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
Erik, On Sunday 23 November 2003 12:51, Erik Hatcher wrote: On Saturday, November 22, 2003, at 06:33 PM, Dion Almaer wrote: 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search this meta data and give them certain weights? Or, if you go down this path do you have to leave the QueryParser behind and build your own queries? Any best practices would be great. And Ype said: You can provide field weights at document indexing time (norms) and use a MultiTermQuery for searching multiple fields. At query time you can again use field weights. I don't know how the scoring of the MultiTermQuery is done, it might use the max. score over the fields of a document, or combine the scores in the fields of a document. end Ype's reply cut and paste I'm a little confused with this question and Ype's reply. MultiTermQuery is an abstract base class under Query, which is the parent for WildcardQuery and FuzzyQuery. What I think you're after is using MultiFieldQueryParser, but you want Thanks for the correction, to weight the fields differently. You can add the boosts at indexing time using Field.setBoost. Unfortunately at the moment and thanks for explaining how to provide field weights. Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
On Saturday 22 November 2003 18:33, Dion Almaer wrote: ... 1. The power of dates: I am fairly happy with the results of queries on my index. The only issue I have is that at the moment the date of the content isn't considered (since lucene doesn't know about it). Is there a good way in which the date of the content could be used to help with the scoring? So more recent content shows up higher in the stack. I have a date keyword field, but it isn't part of the query itself. Are there any patterns to help with this? You can use the Lucene date field, or use a keyword field eg. in mmdd format. However, Lucene's scoring is not based on the value of a matching term, it's based on term frequencies in documents, on the number of documents in the index containing the term, and on the distance between terms (for proximity queries.) You cannot make the document score depend directly on the value of a (date) field in the document. Btw, how big would you want the date influence to be in the score? Sorting results by date has been discussed in the past, see the archives. You lose the document scores in this case. 2. +field:foo and the QueryParser: I ran into some problems where using +field:foo was giving strange results. When I changed the queries to ... AND field:foo everything was fine. Am I missing something there? Which version of Lucene are you using? There have been some fixes in the query parser of Lucene 1.2, but I don't know precisely which. 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search this meta data and give them certain weights? Or, if you go down You can provide field weights at document indexing time (norms) and use a MultiTermQuery for searching multiple fields. At query time you can again use field weights. I don't know how the scoring of the MultiTermQuery is done, it might use the max. score over the fields of a document, or combine the scores in the fields of a document. this path do you have to leave the QueryParser behind and build your own queries? Any best practices would be great. You have some options: - create the MultiTermQuery from the query text, or - index the default search field as a single field, eg. by concatenation, and evt. by inserting empty tokens in between to avoid proximity matches. This has also been discussed recently, see eg. the discussion on indexing of sentences. Searching mutliple fields is normally a little slower than searching a concatenated field. The actual difference depends on you data, so you might experiment a bit. You might eg. index all fields seperately, and also index a default concatenated field. Kind regards, Ype Kingma - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Dates and others
Thanks for responding Ype. -Original Message- From: Ype Kingma [mailto:[EMAIL PROTECTED] Sent: Sunday, November 23, 2003 2:03 PM To: Lucene Users List Subject: Re: Dates and others On Saturday 22 November 2003 18:33, Dion Almaer wrote: ... 1. The power of dates: I am fairly happy with the results of queries on my index. The only issue I have is that at the moment the date of the content isn't considered (since lucene doesn't know about it). Is there a good way in which the date of the content could be used to help with the scoring? So more recent content shows up higher in the stack. I have a date keyword field, but it isn't part of the query itself. Are there any patterns to help with this? You can use the Lucene date field, or use a keyword field eg. in mmdd format. However, Lucene's scoring is not based on the value of a matching term, it's based on term frequencies in documents, on the number of documents in the index containing the term, and on the distance between terms (for proximity queries.) You cannot make the document score depend directly on the value of a (date) field in the document. Btw, how big would you want the date influence to be in the score? Sorting results by date has been discussed in the past, see the archives. You lose the document scores in this case. Yeah this is tough. I don't want to sort by date as then something that was a really low score but was recent would show up at the top. I think I will stick with giving the user the ability to say between Jan 1 2003 and ... instead. This leads me to another issue actually. On certain range queries I get exceptions: Query: modifieddate:[1/1/03 TO 12/31/03] org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101) at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:137) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188) at org.apache.lucene.search.Query.weight(Query.java:120) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93) at org.apache.lucene.search.Hits.init(Hits.java:80) at org.apache.lucene.search.Searcher.search(Searcher.java:71) at org.apache.lucene.search.Searcher.search(Searcher.java:65) at com.portal.util.search.IndexSearch.search(Unknown Source) at com.portal.util.search.IndexSearch.main(Unknown Source) Has anyone run into this problem? 2. +field:foo and the QueryParser: I ran into some problems where using +field:foo was giving strange results. When I changed the queries to ... AND field:foo everything was fine. Am I missing something there? Which version of Lucene are you using? There have been some fixes in the query parser of Lucene 1.2, but I don't know precisely which. I am using 1.3 RC 2. The AND workaround is fine... just caught me by surprise. 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search this meta data and give them certain weights? Or, if you go down You can provide field weights at document indexing time (norms) and use a MultiTermQuery for searching multiple fields. At query time you can again use field weights. I don't know how the scoring of the MultiTermQuery is done, it might use the max. score over the fields of a document, or combine the scores in the fields of a document. Yeah I will play with this. For now adding the title to the main body does seem to work pretty well, so it may be good enough! this path do you have to leave the QueryParser behind and build your own queries? Any best practices would be great. You have some options: - create the MultiTermQuery from the query text, or - index the default search field as a single field, eg. by concatenation, and evt. by inserting empty tokens in between to avoid proximity matches. This has also been discussed recently, see eg. the discussion on indexing of sentences. Searching mutliple fields is normally a little slower than searching a concatenated field. The actual difference depends on you data, so you might experiment a bit. You might eg. index all fields seperately, and also index a default concatenated field. Kind regards, Ype Kingma Thanks a lot for the great ideas. Dion - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
On Saturday, November 22, 2003, at 06:33 PM, Dion Almaer wrote: 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search this meta data and give them certain weights? Or, if you go down this path do you have to leave the QueryParser behind and build your own queries? Any best practices would be great. And Ype said: You can provide field weights at document indexing time (norms) and use a MultiTermQuery for searching multiple fields. At query time you can again use field weights. I don't know how the scoring of the MultiTermQuery is done, it might use the max. score over the fields of a document, or combine the scores in the fields of a document. end Ype's reply cut and paste I'm a little confused with this question and Ype's reply. MultiTermQuery is an abstract base class under Query, which is the parent for WildcardQuery and FuzzyQuery. What I think you're after is using MultiFieldQueryParser, but you want to weight the fields differently. You can add the boosts at indexing time using Field.setBoost. Unfortunately at the moment MultiFieldQueryParser is not very extensible - there are some open issues with its subclassability but subclassing MFQP and overriding getFieldQuery will do the trick when the subclassing issues are resolved allowing you to boost at query time. Making an educated guess at what you're doing with Lucene, Dion, I'd venture to say that boosting at indexing time is sufficient for your needs. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
On Sunday, November 23, 2003, at 03:33 PM, Dion Almaer wrote: This leads me to another issue actually. On certain range queries I get exceptions: Query: modifieddate:[1/1/03 TO 12/31/03] org.apache.lucene.search.BooleanQuery$TooManyClauses I'm guessing you're using Field.Keyword(String, Date) for modifieddate? The date field stuff in Lucene is really a timestamp, and doing a range query enumerates all the terms for that field in that ranging making a big ol' boolean OR query of all the individual ones. Since you want this to be just a date, use Field.Keyword(String, MMDD) instead. But you'll want to subclass QueryParser and override getRangeQuery to do the right date format parsing from MM/DD/ into MMDD rather than the internal Date representation Lucene uses for date fields. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
On Sunday, November 23, 2003, at 03:33 PM, Dion Almaer wrote: 2. +field:foo and the QueryParser: I ran into some problems where using +field:foo was giving strange results. When I changed the queries to ... AND field:foo everything was fine. Am I missing something there? Which version of Lucene are you using? There have been some fixes in the query parser of Lucene 1.2, but I don't know precisely which. I am using 1.3 RC 2. The AND workaround is fine... just caught me by surprise. If you run into the strange results again, try doing a Query.toString(field) to see what QueryParser thought of the query expression. I'd be curious to know more about this issue. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Dates and others
Hi guys - First off I want to just give the Lucene project credit for producing an API like this. Truly great stuff. I was just wondering if anyone could share some wisdom on a couple of issues: 1. The power of dates: I am fairly happy with the results of queries on my index. The only issue I have is that at the moment the date of the content isn't considered (since lucene doesn't know about it). Is there a good way in which the date of the content could be used to help with the scoring? So more recent content shows up higher in the stack. I have a date keyword field, but it isn't part of the query itself. Are there any patterns to help with this? 2. +field:foo and the QueryParser: I ran into some problems where using +field:foo was giving strange results. When I changed the queries to ... AND field:foo everything was fine. Am I missing something there? 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search this meta data and give them certain weights? Or, if you go down this path do you have to leave the QueryParser behind and build your own queries? Any best practices would be great. Sorry for bugging the list. Cheers, Dion - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]