Re: Scoring documents by Click Count

2004-05-06 Thread Boris Goldowsky
On Thu, 2004-05-06 at 13:58, Ype Kingma wrote:

 Changing the click count this way is ok, but along with that you could
 change the (field) norm for the document to increase it's score
 in subsequent queries.
 You can use Document.setBoost() and/or Field.setBoost() just before
 IndexWriter.addDocument() to do this.

There may be workable ways to do this, but the one time I tried
adjusting boosts of already-indexed documents I found it didn't work
quite as I expected.  The documentation has a warning which explains
why:

getBoost
Returns the boost factor for hits on any field of this document.
[...]
Note: This value is not stored directly with the document in the
index. Documents returned from IndexReader.document(int) and
Hits.doc(int) may thus not have the same value present as when
this document was indexed.

So be cautious and test carefully if you try this -- and let us on the
list know how it goes!

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Adding duplicate Fields to Documents

2004-04-23 Thread Boris Goldowsky
On Thu, 2004-04-22 at 17:31, Gerard Sychay wrote:

 - Adding two fields with same name that are indexed, not tokenized
 (keywords)?  E.g. given (field_name, keyword1) and (field_name,
 keyword2), would the final keyword field be (field_name,
 keyword1keyword2)?  Seems weird..

They don't get concatenated this way - they each end up as separate
terms in the index.  A TermQuery for keyword1 or keyword2 will
retrieve this document.

 - Adding two fields with same name that are stored, but not indexed and
 not tokenized (e.g. database keys)?  Are they appended (which would mess
 up the database key when retrieved from the Hit)?

They are stored separately - you can retrieve them as separate Field
values.

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Result scoring question

2004-04-15 Thread Boris Goldowsky
On Wednesday 14 April 2004 20:55, Armbrust, Daniel C. wrote:

  Is there anything that I can do in my query construction, to ensure that if
  a query exactly matches a document, it will be the top result?

I know of two methods (and would be happy to hear comments or
additions):

1) index the field as a Keyword.  The only result of querying this will
be exact (character-by-character identical) matches.  You can index the
field both as Keyword and as Text if you wish, and construct a query
that attempts both the exact and inexact match, with appropriate
weights.

2) A bit of a hack perhaps, but effective: index the field as zgzgl
text of field zgzgl, and query for the phrase zgzgl text of query
zgzgl.  zgzgl here stands for some token that doesn't otherwise occur
in your data.  Any matches to this phrase, then, are guaranteed to be
matches to complete document fields, but with accommodation for
stopwords, stemming, or whatever your Analyzer does.  Add slop to the
phrase query if you wish, and again, you can attach appropriate weights
to this and combine with other techniques.

Boris
-- 
Boris Goldowsky
[EMAIL PROTECTED]
www.goldowsky.com/consulting


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stemming options

2004-04-11 Thread Boris Goldowsky
Has anyone on the list implemented a dictionary-based English stemmer
with Lucene?  Perhaps based on the freely-available ispell dictionaries
or something like that?  The Porter and Snowball stemmers have not
worked that well for our application, but it is a bit daunting to start
from scratch in developing an alternate stemmer.

Alternatively, is there an algorithmic stemmer that anyone has used
which is a little less aggressive than the Porter algorithm?  We've been
having problems with searches for conversion returning converse and
conversational; and animal returning animate.  Yes, these are
morphologically related, but in our particular application it would be
better to stick with removing simple inflections.

Thanks for any pointers --

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Overriding coordination

2004-03-29 Thread Boris Goldowsky
I have a situation where I'm querying for something in several fields,
with a clause similar to this:
  (title:(two words)^20  keywords:(two words)^10  body:(two words))

Some good documents are being scored too low if the query terms do not
occur in the body field.  I naively thought that would only make a few
% difference, because of the large boosts on the title and keywords
fields, but in fact the document loses 1/3 of its score because of the
coordination term (2/3 rather than 1, because only 2 out of the three
clauses matched).

Now, I love the coordination term for the multiple-word queries
(including the ones embedded in the query above), but for the
conjunction of the different fields I'd like to remove it, and just have
each clause add its score.  I feel like there's a way to do this,
perhaps with a custom Similarity subclass, but I can't quite see how to
set it up.

Can anyone point me in the right direction, or perhaps suggest a
different pathway that I'm missing?

Thanks a lot,

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Cover density ranking?

2004-03-23 Thread Boris Goldowsky
Since there have been a few discussions recently of overriding various
aspects of Lucene's ranking formula, I got to wondering how difficult it
might be to implement something more different from the base tf/idf
ranking system that Lucene has built in.

How difficult would it be to implement something like Cover Density
ranking for Lucene?  Has anyone tried it?  

Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
and is supposed to be particularly good for short queries of the type
that you get in many web applications.

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demoting results

2004-03-22 Thread Boris Goldowsky
On Fri, 2004-03-19 at 11:58, Doug Cutting wrote:
 Doug Cutting wrote:
  On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
 
  Have you tried assigning these very small boosts (0  boost  1) and 
  assigning other query clauses relatively large boosts (boost  1)?
  
  I don't think you understood my proposal.  You should try boosting the 
  documents when you add them.  Instead of adding a doctype field with 
  good and bad values, use Document.setBoost(0.01) at index time.
 
 Sorry.  My mistake.  You did understand my proposal, it was just a bad 
 proposal.  Boosting documents is a better approach, but is less 
 flexible.  I think the final proposal in my previous message might be 
 the best approach (defining a custom coordination function for these 
 query clauses).

Thanks for the ideas - I love the flexibility of Lucene that there are
so many ways to accomplish what at first seemed so difficult.

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demoting results

2004-03-19 Thread Boris Goldowsky
I asked:
  Is there any way to build a query where the occurrence of a particular
  Term (in a Keyword field) causes the rank of the document to be
  decreased?

On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
 Have you tried assigning these very small boosts (0  boost  1) and 
 assigning other query clauses relatively large boosts (boost  1)?

Thanks for the suggestion!  Unfortunately it doesn't have the desired
effect.  I wanted 
  title: asparagus
  various fields...
  doctype: bad

to score lower than 
  title: asparagus
  various similar fields...
  doctype: good

I was trying to formulate a query like, say 
  +(title: asparagus) (doctype:bad)^-3

which would make sure the bad document was ranked lower than any other
value for doctype.  But negative boosts are illegal. 

I tried your suggestion of putting large boost on the first clause and a
small one (0.01) on the second, but the bad document is still ranked 
higher than the good one -- it gets a slight improvement from the
doctype:bad match, times 0.01, which is a very slight improvement but
still positive.  Then it gets a big boost because it has a 1.0 rather
than a 0.5 coordination factor, so the bad item gets top billing.

I think I've identified a few ways to solve the puzzle, though:

(a) enumerate all the possible good types of documents and search for
them, rather than the single bad one.  Harder to maintain since doctypes
can be introduced, but possible.

(b) attach boost values less than one to the bad Documents at indexing
time.  Not as flexible as modifying the query, but plausible.

(c) a more complex query like this:
 (title:asparagus) OR (title:asparagus -doctype:bad)
 so for good documents both clauses will match and the coordination
factor will be in their favor.  This increases query complexity (they
aren't really simple one-term queries like this toy example), but
hopefully that will not be a performance issue.

Bng





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Demoting results

2004-03-17 Thread Boris Goldowsky
Is there any way to build a query where the occurrence of a particular
Term (in a Keyword field) causes the rank of the document to be
decreased?  I have various types of documents, and some of them are less
interesting than others, so I want them to be pushed towards the bottom
of the results ranking.  However, I do not want to eliminate them
entirely, so I can't use a boolean not.

Using negative weights would seem logical here, but apparently has no
effect on rankings - negative weights appear to be treated as zeros.

Any ideas would be appreciated.

Thanks,
Boris


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Paid support for Lucene

2004-01-29 Thread Boris Goldowsky
Strangely, the web site does not seem to list any vendors who provide
incident support for Lucene.  That can't be right, can it?

Can anyone point me to organizations that would be willing to provide
support for Lucene issues?

Thanks,
Boris
-- 
Boris Goldowsky
[EMAIL PROTECTED]
www.goldowsky.com/consulting


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]