Re: Scoring documents by Click Count
On Thu, 2004-05-06 at 13:58, Ype Kingma wrote: Changing the click count this way is ok, but along with that you could change the (field) norm for the document to increase it's score in subsequent queries. You can use Document.setBoost() and/or Field.setBoost() just before IndexWriter.addDocument() to do this. There may be workable ways to do this, but the one time I tried adjusting boosts of already-indexed documents I found it didn't work quite as I expected. The documentation has a warning which explains why: getBoost Returns the boost factor for hits on any field of this document. [...] Note: This value is not stored directly with the document in the index. Documents returned from IndexReader.document(int) and Hits.doc(int) may thus not have the same value present as when this document was indexed. So be cautious and test carefully if you try this -- and let us on the list know how it goes! Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Adding duplicate Fields to Documents
On Thu, 2004-04-22 at 17:31, Gerard Sychay wrote: - Adding two fields with same name that are indexed, not tokenized (keywords)? E.g. given (field_name, keyword1) and (field_name, keyword2), would the final keyword field be (field_name, keyword1keyword2)? Seems weird.. They don't get concatenated this way - they each end up as separate terms in the index. A TermQuery for keyword1 or keyword2 will retrieve this document. - Adding two fields with same name that are stored, but not indexed and not tokenized (e.g. database keys)? Are they appended (which would mess up the database key when retrieved from the Hit)? They are stored separately - you can retrieve them as separate Field values. Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Result scoring question
On Wednesday 14 April 2004 20:55, Armbrust, Daniel C. wrote: Is there anything that I can do in my query construction, to ensure that if a query exactly matches a document, it will be the top result? I know of two methods (and would be happy to hear comments or additions): 1) index the field as a Keyword. The only result of querying this will be exact (character-by-character identical) matches. You can index the field both as Keyword and as Text if you wish, and construct a query that attempts both the exact and inexact match, with appropriate weights. 2) A bit of a hack perhaps, but effective: index the field as zgzgl text of field zgzgl, and query for the phrase zgzgl text of query zgzgl. zgzgl here stands for some token that doesn't otherwise occur in your data. Any matches to this phrase, then, are guaranteed to be matches to complete document fields, but with accommodation for stopwords, stemming, or whatever your Analyzer does. Add slop to the phrase query if you wish, and again, you can attach appropriate weights to this and combine with other techniques. Boris -- Boris Goldowsky [EMAIL PROTECTED] www.goldowsky.com/consulting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stemming options
Has anyone on the list implemented a dictionary-based English stemmer with Lucene? Perhaps based on the freely-available ispell dictionaries or something like that? The Porter and Snowball stemmers have not worked that well for our application, but it is a bit daunting to start from scratch in developing an alternate stemmer. Alternatively, is there an algorithmic stemmer that anyone has used which is a little less aggressive than the Porter algorithm? We've been having problems with searches for conversion returning converse and conversational; and animal returning animate. Yes, these are morphologically related, but in our particular application it would be better to stick with removing simple inflections. Thanks for any pointers -- Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Overriding coordination
I have a situation where I'm querying for something in several fields, with a clause similar to this: (title:(two words)^20 keywords:(two words)^10 body:(two words)) Some good documents are being scored too low if the query terms do not occur in the body field. I naively thought that would only make a few % difference, because of the large boosts on the title and keywords fields, but in fact the document loses 1/3 of its score because of the coordination term (2/3 rather than 1, because only 2 out of the three clauses matched). Now, I love the coordination term for the multiple-word queries (including the ones embedded in the query above), but for the conjunction of the different fields I'd like to remove it, and just have each clause add its score. I feel like there's a way to do this, perhaps with a custom Similarity subclass, but I can't quite see how to set it up. Can anyone point me in the right direction, or perhaps suggest a different pathway that I'm missing? Thanks a lot, Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Cover density ranking?
Since there have been a few discussions recently of overriding various aspects of Lucene's ranking formula, I got to wondering how difficult it might be to implement something more different from the base tf/idf ranking system that Lucene has built in. How difficult would it be to implement something like Cover Density ranking for Lucene? Has anyone tried it? Cover density is described at http://citeseer.ist.psu.edu/558750.html , and is supposed to be particularly good for short queries of the type that you get in many web applications. Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Demoting results
On Fri, 2004-03-19 at 11:58, Doug Cutting wrote: Doug Cutting wrote: On Thu, 2004-03-18 at 13:32, Doug Cutting wrote: Have you tried assigning these very small boosts (0 boost 1) and assigning other query clauses relatively large boosts (boost 1)? I don't think you understood my proposal. You should try boosting the documents when you add them. Instead of adding a doctype field with good and bad values, use Document.setBoost(0.01) at index time. Sorry. My mistake. You did understand my proposal, it was just a bad proposal. Boosting documents is a better approach, but is less flexible. I think the final proposal in my previous message might be the best approach (defining a custom coordination function for these query clauses). Thanks for the ideas - I love the flexibility of Lucene that there are so many ways to accomplish what at first seemed so difficult. Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Demoting results
I asked: Is there any way to build a query where the occurrence of a particular Term (in a Keyword field) causes the rank of the document to be decreased? On Thu, 2004-03-18 at 13:32, Doug Cutting wrote: Have you tried assigning these very small boosts (0 boost 1) and assigning other query clauses relatively large boosts (boost 1)? Thanks for the suggestion! Unfortunately it doesn't have the desired effect. I wanted title: asparagus various fields... doctype: bad to score lower than title: asparagus various similar fields... doctype: good I was trying to formulate a query like, say +(title: asparagus) (doctype:bad)^-3 which would make sure the bad document was ranked lower than any other value for doctype. But negative boosts are illegal. I tried your suggestion of putting large boost on the first clause and a small one (0.01) on the second, but the bad document is still ranked higher than the good one -- it gets a slight improvement from the doctype:bad match, times 0.01, which is a very slight improvement but still positive. Then it gets a big boost because it has a 1.0 rather than a 0.5 coordination factor, so the bad item gets top billing. I think I've identified a few ways to solve the puzzle, though: (a) enumerate all the possible good types of documents and search for them, rather than the single bad one. Harder to maintain since doctypes can be introduced, but possible. (b) attach boost values less than one to the bad Documents at indexing time. Not as flexible as modifying the query, but plausible. (c) a more complex query like this: (title:asparagus) OR (title:asparagus -doctype:bad) so for good documents both clauses will match and the coordination factor will be in their favor. This increases query complexity (they aren't really simple one-term queries like this toy example), but hopefully that will not be a performance issue. Bng - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Demoting results
Is there any way to build a query where the occurrence of a particular Term (in a Keyword field) causes the rank of the document to be decreased? I have various types of documents, and some of them are less interesting than others, so I want them to be pushed towards the bottom of the results ranking. However, I do not want to eliminate them entirely, so I can't use a boolean not. Using negative weights would seem logical here, but apparently has no effect on rankings - negative weights appear to be treated as zeros. Any ideas would be appreciated. Thanks, Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Paid support for Lucene
Strangely, the web site does not seem to list any vendors who provide incident support for Lucene. That can't be right, can it? Can anyone point me to organizations that would be willing to provide support for Lucene issues? Thanks, Boris -- Boris Goldowsky [EMAIL PROTECTED] www.goldowsky.com/consulting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]