Suggestions for documentation or LIA

2005-01-26 Thread Ian Soboroff
Erik Hatcher <[EMAIL PROTECTED]> writes:

> By all means, if you have other suggestions for our site, let us know 
> at [EMAIL PROTECTED]

One of the things I would like to see, but which isn't either in the
Lucene site, documentation, or "Lucene in Action", is a complete
description of how the retrieval algorithm works.  That is, how the
HitCollector, Scorers, Similarity, etc all fit together.

I'm involved in a project which to some degree is looking at poking
deeply into this part of the Lucene code.  We have a nice (non-Lucene)
framework for working with more different kinds of similarity
functions (beyond tf-idf) which should also be expandable to include
query expansion, relevance feedback, and the like.  

I used to think that integrating it would be as simple as hacking in
Similarity, but I'm beginning to think it might need broader changes.
I could obviously hook in our whole retrieval setup by just diving for
an IndexReader and doing it all by hand, but then I would have to redo
the incremental search and possibly the rich query structure, which
would be a lose.

So anyway, I got LIA hoping for a good explanation (not a good
Explanation) on this bit, but it wasn't there.  There are some hints
on the Lucene site, but nothing complete.  If I muddle it out before
anything gets contributed, I'll try to write something up, but don't
expect anything too soon...

Ian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread jian chen
Hi,

Just to continue this discussion. I think right now Lucene's retrieval
algorithm is based purely on Vector Space Model, which is simple and
efficient.

However, there maybe cases where folks like me want to use another set
of completely different ranking algorithms, those which do not even
use tf/idf.

For example, I am thinking about adding Cover Density ranking
algorithm to lucene, which is for now purely based on the proximity
information and does not require any global ranking variables. But
looking into the lucene code, it seems not very easy to make a hack
for that. At least, for me, a novice lucene user.

I read on the lucene whiteboard 2.0 that lucene will accomodate more
in terms of what to be indexed and such. That move might be good for
implementing other or ad hoc ranking algorithms.

Cheers,

Jian


On Wed, 26 Jan 2005 10:25:15 -0500, Ian Soboroff <[EMAIL PROTECTED]> wrote:
> Erik Hatcher <[EMAIL PROTECTED]> writes:
> 
> > By all means, if you have other suggestions for our site, let us know
> > at [EMAIL PROTECTED]
> 
> One of the things I would like to see, but which isn't either in the
> Lucene site, documentation, or "Lucene in Action", is a complete
> description of how the retrieval algorithm works.  That is, how the
> HitCollector, Scorers, Similarity, etc all fit together.
> 
> I'm involved in a project which to some degree is looking at poking
> deeply into this part of the Lucene code.  We have a nice (non-Lucene)
> framework for working with more different kinds of similarity
> functions (beyond tf-idf) which should also be expandable to include
> query expansion, relevance feedback, and the like.
> 
> I used to think that integrating it would be as simple as hacking in
> Similarity, but I'm beginning to think it might need broader changes.
> I could obviously hook in our whole retrieval setup by just diving for
> an IndexReader and doing it all by hand, but then I would have to redo
> the incremental search and possibly the rich query structure, which
> would be a lose.
> 
> So anyway, I got LIA hoping for a good explanation (not a good
> Explanation) on this bit, but it wasn't there.  There are some hints
> on the Lucene site, but nothing complete.  If I muddle it out before
> anything gets contributed, I'll try to write something up, but don't
> expect anything too soon...
> 
> Ian
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread Ian Soboroff
jian chen <[EMAIL PROTECTED]> writes:

> Just to continue this discussion. I think right now Lucene's retrieval
> algorithm is based purely on Vector Space Model, which is simple and
> efficient.

As I understand it, it's indeed a tf-idf vector space approach, except
that the queries are structured and as such, the tf-idf weights are
totaled as a straight cosine among siblings of a BooleanQuery, but
other query nodes may do things differently, for example, I haven't
read it but I assume PhraseQueries require all terms present and
adjacent to contribute to the score.

There is also a document-specific boost factor in the equation which
is essentially a hook for document things like recency, PageRank, etc
etc.

You can tweak this by defining custom Similarity classes which can say
what the tf, idf, norm, and boost mean.  You can also affect the
term normalization at the query end in BooleanScorer (I think? through
the sumOfSquares method?).

We've implemented something kind of like the Similarity class but
based on a model which decsribes a larger family of "similarity
functions".  (For the curious or similarly IR-geeky, it's from Justin
Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
need more general hooks than the Lucene Similarity provides.  I think
those hooks might exist, but I'm not sure I know which classes they're
in.

I'm also interested in things like relevance feedback which can affect
term weights as well as adding terms to the query... just how many
places in the code do I have to subclass or change?

It's clear that if I'm interested in a completely different model like
language modeling the IndexReader is the way to go.  In which case,
what parts of the Lucene class structure should I adapt to maintain
the incremental-results-return, inverted list skips, and other
features which make the inverted search fast?

Ian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread jian chen
Hi, Ian,

Thanks for your information. It would be really helpful to have some
documentation maybe on the WIKI about retrieval algorithm and how to
hack it. At least, something there even if like several paragraphs to
get started...

Thanks,

Jian

On Wed, 26 Jan 2005 12:40:54 -0500, Ian Soboroff <[EMAIL PROTECTED]> wrote:
> jian chen <[EMAIL PROTECTED]> writes:
> 
> > Just to continue this discussion. I think right now Lucene's retrieval
> > algorithm is based purely on Vector Space Model, which is simple and
> > efficient.
> 
> As I understand it, it's indeed a tf-idf vector space approach, except
> that the queries are structured and as such, the tf-idf weights are
> totaled as a straight cosine among siblings of a BooleanQuery, but
> other query nodes may do things differently, for example, I haven't
> read it but I assume PhraseQueries require all terms present and
> adjacent to contribute to the score.
> 
> There is also a document-specific boost factor in the equation which
> is essentially a hook for document things like recency, PageRank, etc
> etc.
> 
> You can tweak this by defining custom Similarity classes which can say
> what the tf, idf, norm, and boost mean.  You can also affect the
> term normalization at the query end in BooleanScorer (I think? through
> the sumOfSquares method?).
> 
> We've implemented something kind of like the Similarity class but
> based on a model which decsribes a larger family of "similarity
> functions".  (For the curious or similarly IR-geeky, it's from Justin
> Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
> need more general hooks than the Lucene Similarity provides.  I think
> those hooks might exist, but I'm not sure I know which classes they're
> in.
> 
> I'm also interested in things like relevance feedback which can affect
> term weights as well as adding terms to the query... just how many
> places in the code do I have to subclass or change?
> 
> It's clear that if I'm interested in a completely different model like
> language modeling the IndexReader is the way to go.  In which case,
> what parts of the Lucene class structure should I adapt to maintain
> the incremental-results-return, inverted list skips, and other
> features which make the inverted search fast?
> 
> Ian
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread Erik Hatcher
On Jan 26, 2005, at 10:25 AM, Ian Soboroff wrote:
Erik Hatcher <[EMAIL PROTECTED]> writes:
By all means, if you have other suggestions for our site, let us know
at [EMAIL PROTECTED]
One of the things I would like to see, but which isn't either in the
Lucene site, documentation, or "Lucene in Action", is a complete
description of how the retrieval algorithm works.  That is, how the
HitCollector, Scorers, Similarity, etc all fit together.
I'm involved in a project which to some degree is looking at poking
deeply into this part of the Lucene code.  We have a nice (non-Lucene)
framework for working with more different kinds of similarity
functions (beyond tf-idf) which should also be expandable to include
query expansion, relevance feedback, and the like.
I used to think that integrating it would be as simple as hacking in
Similarity, but I'm beginning to think it might need broader changes.
I could obviously hook in our whole retrieval setup by just diving for
an IndexReader and doing it all by hand, but then I would have to redo
the incremental search and possibly the rich query structure, which
would be a lose.
So anyway, I got LIA hoping for a good explanation (not a good
Explanation) on this bit, but it wasn't there.
Hacking Similarity wasn't covered in LIA for one simple reason - 
Lucene's built-in scoring mechanism really is good enough for almost 
all projects.  The book was written for developers of those projects.

Personally, I've not had to hack Similarity, though I've toyed with it 
in prototypes and am using a minor tweak (turning off length 
normalization for the "title" field) for the lucenebook.com book 
indexing.

  There are some hints
on the Lucene site, but nothing complete.  If I muddle it out before
anything gets contributed, I'll try to write something up, but don't
expect anything too soon...
And maybe you'd contribute what you write to LIA 2nd edition :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Suggestions for documentation or LIA

2005-01-26 Thread Paul Elschot
On Wednesday 26 January 2005 18:40, Ian Soboroff wrote:
> jian chen <[EMAIL PROTECTED]> writes:
> 
> > Just to continue this discussion. I think right now Lucene's retrieval
> > algorithm is based purely on Vector Space Model, which is simple and
> > efficient.
> 
> As I understand it, it's indeed a tf-idf vector space approach, except
> that the queries are structured and as such, the tf-idf weights are
> totaled as a straight cosine among siblings of a BooleanQuery, but
> other query nodes may do things differently, for example, I haven't
> read it but I assume PhraseQueries require all terms present and
> adjacent to contribute to the score.
> 
> There is also a document-specific boost factor in the equation which
> is essentially a hook for document things like recency, PageRank, etc
> etc.
> 
> You can tweak this by defining custom Similarity classes which can say
> what the tf, idf, norm, and boost mean.  You can also affect the
> term normalization at the query end in BooleanScorer (I think? through
> the sumOfSquares method?).
> 
> We've implemented something kind of like the Similarity class but
> based on a model which decsribes a larger family of "similarity
> functions".  (For the curious or similarly IR-geeky, it's from Justin
> Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
> need more general hooks than the Lucene Similarity provides.  I think
> those hooks might exist, but I'm not sure I know which classes they're
> in.
> 
> I'm also interested in things like relevance feedback which can affect
> term weights as well as adding terms to the query... just how many
> places in the code do I have to subclass or change?

None. Create your own TermQuery instances, set their boosts,
and add them to a BooleanQuery.
 
> It's clear that if I'm interested in a completely different model like
> language modeling the IndexReader is the way to go.  In which case,
> what parts of the Lucene class structure should I adapt to maintain
> the incremental-results-return, inverted list skips, and other
> features which make the inverted search fast?

To keep the speed, the one thing you should keep is the performance of
TermQuery. In case you're interested in changing proximity scores,
the same holds for SpanTermQuery.
For a variation on TermQuery that scores query terms by their density in a
document field you can have a look here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31784

On top of these you can implement your own Scorers, but for Zobel's
similarities you probably won't need much more than what BooleanQuery
provides.
To use the inverted list skips, make sure to implement and use skipTo()
on your scorers.
In case you need larger queries in conjunctive normal form:
+(synA1 synA2 ) +(synB1 synB2 ...) +(synC1 synC2 ...) 
the development version of BooleanQuery might be a bit faster
than the current one.

For an interesting twist in the use of idf please search
for "fuzzy scoring changes" on lucene-dev at the end of 2004.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread Ian Soboroff
Erik Hatcher <[EMAIL PROTECTED]> writes:

> Hacking Similarity wasn't covered in LIA for one simple reason - 
> Lucene's built-in scoring mechanism really is good enough for almost 
> all projects.  The book was written for developers of those projects.
>
> Personally, I've not had to hack Similarity, though I've toyed with it 
> in prototypes and am using a minor tweak (turning off length 
> normalization for the "title" field) for the lucenebook.com book 
> indexing.
>
>>   There are some hints
>> on the Lucene site, but nothing complete.  If I muddle it out before
>> anything gets contributed, I'll try to write something up, but don't
>> expect anything too soon...
>
> And maybe you'd contribute what you write to LIA 2nd edition :)

Maybe that too.  ;-) What we're working on isn't aimed at the site
admin who wants to tweak site search, it's more aimed at the IR
researcher.  Among other things it handles Cranfield-style batch
experiments and many standard IR test collections, for example.

Ian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]