Re: Dates and others

2003-12-02 Thread Erik Hatcher
On Monday, December 1, 2003, at 11:55  PM, Tatu Saloranta wrote:
On a related note, it would also be nice if there was a way to start
categorizing general hot topics for Lucene developers; it seems like 
there
are about half a dozen areas where there's lots of interest for 
improvements
(most of them related to ranking). If so, perhaps there could be more
specific discussion groups, and also perhaps web pages summarizing 
some of
discussions, consensus achieved, even if there's no code to show for 
it?
I agree.

Sounds like the perfect solution is a wiki!  Just happens we have one:

	http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneProjectPages

Have at it!

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dates and others

2003-12-01 Thread Doug Cutting
Dion Almaer wrote:
The only real item that I still want to tweak more is getting recent results higher in the list.

I was wondering if something like this could work (or if there is a better solution)

At index time, I have the date of the content.  I could do some math where the higher 
the date
(based on the time_t version or whatever) the more of a setBoost(metric). Or, for 
every month in the
past, create a larger negative number to setBoost()... or something like that.
Would something like this make sense?
The problem with this approach is that eventually you'll exhaust the 
range of the boost.  So this will only work if you re-index things from 
scratch periodically, with a boost of something like 1/days-ago.

If you're adding documents to the index in date order, then you could 
use a HitCollector which adjusts scores according to the document 
number, since document numbers increase as you add to the index.

If you're not adding things in date order, then you can, when you open 
the index, build an array mapping document numbers to integer dates. 
Then your hit collector can use this to either boost or sort hits by date.

Or you could add a month or week field to documents, then add it as 
a clause to your queries with a boost.  Then documents matching the most 
recent week(s) and/or month(s) would get the boost.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Dates and others

2003-12-01 Thread Chong, Herb
ad hoc techniques run into lots of trouble because the requirement on Lucene isn't 
well specified. is a document with one of the search terms that is a week newer enough 
to move it ahead of a document that has all of the search terms? the boost mechanism 
is a way to move documents around in the ranking list, but it clearly is a way to 
reweight the importance of the query terms and not to impose external constraints that 
properly should be handled outside the search engine.

Herb...

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, December 01, 2003 1:11 PM
To: Lucene Users List
Subject: Re: Dates and others

The problem with this approach is that eventually you'll exhaust the 
range of the boost.  So this will only work if you re-index things from 
scratch periodically, with a boost of something like 1/days-ago.

If you're adding documents to the index in date order, then you could 
use a HitCollector which adjusts scores according to the document 
number, since document numbers increase as you add to the index.

If you're not adding things in date order, then you can, when you open 
the index, build an array mapping document numbers to integer dates. 
Then your hit collector can use this to either boost or sort hits by date.

Or you could add a month or week field to documents, then add it as 
a clause to your queries with a boost.  Then documents matching the most 
recent week(s) and/or month(s) would get the boost.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dates and others

2003-12-01 Thread Tatu Saloranta
On Monday 01 December 2003 15:13, Dion Almaer wrote:
...
 Interesting.  I implemented an approach which boosted based on the number
 of months in the past, and after tweaking the boost amounts, it seems to do
 the job. I do a fresh reindex every night (since the indexing process takes
 no time at all... unlike our old search solution!)

This sounds interesting, as I have been thinking of what's the best way
to boost newer documents. Can you share some of your experience regarding 
boost values that seemed to make sense? In my case, CMS I'm working on stores 
support documentation for software/hardware, meaning that content is highly 
time-sensitive (ie. documents decay pretty quickly).

Since the system is already doing both incremental reindexing, and nightly 
full reindexing (latter to make sure that even if temporarily some changed 
content was not [fully] reindexed, it eventually gets indexed properly), I 
can fairly easily add boosting I think.

On a related note, it would also be nice if there was a way to start 
categorizing general hot topics for Lucene developers; it seems like there 
are about half a dozen areas where there's lots of interest for improvements 
(most of them related to ranking). If so, perhaps there could be more 
specific discussion groups, and also perhaps web pages summarizing some of 
discussions, consensus achieved, even if there's no code to show for it?

-+ Tatu +-


 I read content for the index from different sources. Sometimes the source
 gives me documents loosely in date order, but not all of them. So, it seems
 that one of the other approaches should be taken (adding a month/week field
 etc).  I should look more into the HitCollector and see how it can help me.

 The other issue I have is that I would like to prioritize the title field. 
 At the moment I am lazy and add the title to the body (contents = title +
 body) which seems to be OK... however sometimes something that mentions the
 search term in the title should appear higher up in the pecking order.

 I am using the QueryParser (subclassed to disallow wildcards etc) to do the
 dirty work for me. Should I get away from this and manage the queries
 myself (and run a Multi against the title field as well as the contents?

 Thanks for the great feedback,

 Dion


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Dates and others

2003-11-26 Thread Dion Almaer
Hi guys -

So I am getting happier with search, and just pushed the lucene version live at:

http://www.theserverside.com (on the leftbar) and:
http://www.theserverside.com/home/search/index.jsp

The only real item that I still want to tweak more is getting recent results higher in 
the list.

I was wondering if something like this could work (or if there is a better solution)

At index time, I have the date of the content.  I could do some math where the higher 
the date
(based on the time_t version or whatever) the more of a setBoost(metric). Or, for 
every month in the
past, create a larger negative number to setBoost()... or something like that.

Would something like this make sense?

Dion


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, November 23, 2003 3:52 PM
 To: Lucene Users List
 Subject: Re: Dates and others
 
 On Saturday, November 22, 2003, at 06:33  PM, Dion Almaer wrote:
  3. I have some fields suck as title, owner, etc as well as 
 the content 
  blob which I index and use as the default search field.  Is 
 there an 
  easy way to extend the QueryParser to merge it with a 
 MultiTermQuery 
  which can also search this meta data and give them certain 
 weights?  
  Or, if you go down this path do you have to leave the QueryParser 
  behind and build your own queries?  Any best practices 
 would be great.
 
 And Ype said:
 You can provide field weights at document indexing time 
 (norms) and use a MultiTermQuery for searching multiple 
 fields. At query time you can again use field weights.
 I don't know how the scoring of the MultiTermQuery is done, 
 it might use the max. score over the fields of a document, or 
 combine the scores in the fields of a document.
  end Ype's reply cut and paste
 
 I'm a little confused with this question and Ype's reply.  
 MultiTermQuery is an abstract base class under Query, which 
 is the parent for WildcardQuery and FuzzyQuery.
 
 What I think you're after is using MultiFieldQueryParser, but 
 you want to weight the fields differently.  You can add the 
 boosts at indexing time using Field.setBoost.  Unfortunately 
 at the moment MultiFieldQueryParser is not very extensible - 
 there are some open issues with its subclassability but 
 subclassing MFQP and overriding getFieldQuery will do the 
 trick when the subclassing issues are resolved allowing you 
 to boost at query time.
 
 Making an educated guess at what you're doing with Lucene, 
 Dion, I'd venture to say that boosting at indexing time is 
 sufficient for your needs.
 
   Erik
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Dates and others

2003-11-24 Thread Dion Almaer
Erik -

Spot on. I should have listened to your advice from the talk and just used MMDD :)

Everything works nicely now that I do the conversion.

Thanks for the great ideas.

Dion 

 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, November 23, 2003 11:41 PM
 To: Lucene Users List
 Subject: Re: Dates and others
 
 On Sunday, November 23, 2003, at 03:33  PM, Dion Almaer wrote:
  This leads me to another issue actually.  On certain range 
 queries I 
  get exceptions:
 
  Query: modifieddate:[1/1/03 TO 12/31/03]
 
  org.apache.lucene.search.BooleanQuery$TooManyClauses
 
 I'm guessing you're using Field.Keyword(String, Date) for 
 modifieddate? 
   The date field stuff in Lucene is really a timestamp, and 
 doing a range query enumerates all the terms for that field 
 in that ranging making a big ol' boolean OR query of all the 
 individual ones.  Since you want this to be just a date, use 
 Field.Keyword(String, MMDD) instead.  But you'll 
 want to subclass QueryParser and override getRangeQuery to do 
 the right date format parsing from MM/DD/ 
 into MMDD rather than the internal Date representation 
 Lucene uses for date fields.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dates and others

2003-11-24 Thread Ype Kingma
Erik,

On Sunday 23 November 2003 12:51, Erik Hatcher wrote:
 On Saturday, November 22, 2003, at 06:33  PM, Dion Almaer wrote:
  3. I have some fields suck as title, owner, etc as well as the content
  blob which I index and use as
  the default search field.  Is there an easy way to extend the
  QueryParser to merge it with a
  MultiTermQuery which can also search this meta data and give them
  certain weights?  Or, if you go
  down this path do you have to leave the QueryParser behind and build
  your own queries?  Any best
  practices would be great.

 And Ype said:
 You can provide field weights at document indexing time (norms) and use
 a
 MultiTermQuery for searching multiple fields. At query time you can
 again use field weights.
 I don't know how the scoring of the MultiTermQuery is done,
 it might use the max. score over the fields of a document, or combine
 the
 scores in the fields of a document.
  end Ype's reply cut and paste

 I'm a little confused with this question and Ype's reply.
 MultiTermQuery is an abstract base class under Query, which is the
 parent for WildcardQuery and FuzzyQuery.

 What I think you're after is using MultiFieldQueryParser, but you want

Thanks for the correction,

 to weight the fields differently.  You can add the boosts at indexing
 time using Field.setBoost.  Unfortunately at the moment

and thanks for explaining how to provide field weights.

Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dates and others

2003-11-23 Thread Ype Kingma
On Saturday 22 November 2003 18:33, Dion Almaer wrote:
...

 1. The power of dates:

I am fairly happy with the results of queries on my index.  The only
 issue I have is that at the moment the date of the content isn't considered
 (since lucene doesn't know about it).  Is there a good way in which the
 date of the content could be used to help with the scoring?  So more recent
 content shows up higher in the stack.  I have a date keyword field, but it
 isn't part of the query itself.  Are there any patterns to help with this?

You can use the Lucene date field, or use a keyword field eg. in mmdd
format. However, Lucene's scoring is not based on the value of
a matching term, it's based on term frequencies in documents, on
the number of documents in the index containing the term, and
on the distance between terms (for proximity queries.)
You cannot make the document score depend directly on the value of 
a (date) field in the document.
Btw, how big would you want the date influence to be in the score?

Sorting results by date has been discussed in the past,  see the archives.
You lose the document scores in this case.

 2. +field:foo and the QueryParser:

I ran into some problems where using +field:foo was giving strange
 results.  When I changed the queries to ... AND field:foo everything was
 fine.
Am I missing something there?

Which version of Lucene are you using? There have been
some fixes in the query parser of Lucene 1.2, but I don't know 
precisely which.

 3. I have some fields suck as title, owner, etc as well as the content blob
 which I index and use as the default search field.  Is there an easy way to
 extend the QueryParser to merge it with a MultiTermQuery which can also
 search this meta data and give them certain weights?  Or, if you go down

You can provide field weights at document indexing time (norms) and use a
MultiTermQuery for searching multiple fields. At query time you can
again use field weights.
I don't know how the scoring of the MultiTermQuery is done,
it might use the max. score over the fields of a document, or combine the
scores in the fields of a document.

 this path do you have to leave the QueryParser behind and build your own
 queries?  Any best practices would be great.

You have some options:
- create the MultiTermQuery from the query text, or
- index the default search field as a single field, eg. by concatenation, and
evt. by inserting empty tokens in between to avoid proximity matches.
This has also been discussed recently, see eg. the discussion on
indexing of sentences.

Searching mutliple fields is normally a little slower than searching a
concatenated field. The actual difference depends on you data, so
you might experiment a bit. You might eg. index all fields
seperately, and also index a default concatenated field.

Kind regards,
Ype Kingma


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Dates and others

2003-11-23 Thread Dion Almaer
Thanks for responding Ype.

 -Original Message-
 From: Ype Kingma [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, November 23, 2003 2:03 PM
 To: Lucene Users List
 Subject: Re: Dates and others
 
 On Saturday 22 November 2003 18:33, Dion Almaer wrote:
 ...
 
  1. The power of dates:
 
 I am fairly happy with the results of queries on my index.  The 
  only issue I have is that at the moment the date of the 
 content isn't 
  considered (since lucene doesn't know about it).  Is there 
 a good way 
  in which the date of the content could be used to help with the 
  scoring?  So more recent content shows up higher in the 
 stack.  I have 
  a date keyword field, but it isn't part of the query 
 itself.  Are there any patterns to help with this?
 
 You can use the Lucene date field, or use a keyword field eg. 
 in mmdd format. However, Lucene's scoring is not based on 
 the value of a matching term, it's based on term frequencies 
 in documents, on the number of documents in the index 
 containing the term, and on the distance between terms (for 
 proximity queries.) You cannot make the document score depend 
 directly on the value of a (date) field in the document.
 Btw, how big would you want the date influence to be in the score?
 
 Sorting results by date has been discussed in the past,  see 
 the archives.
 You lose the document scores in this case.


Yeah this is tough.  I don't want to sort by date as then something that was a really 
low score but
was recent would show up at the top.
I think I will stick with giving the user the ability to say between Jan 1 2003 and 
... instead.

This leads me to another issue actually.  On certain range queries I get exceptions:

Query: modifieddate:[1/1/03 TO 12/31/03]

org.apache.lucene.search.BooleanQuery$TooManyClauses
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101)
at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:137)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188)
at org.apache.lucene.search.Query.weight(Query.java:120)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
at org.apache.lucene.search.Hits.init(Hits.java:80)
at org.apache.lucene.search.Searcher.search(Searcher.java:71)
at org.apache.lucene.search.Searcher.search(Searcher.java:65)
at com.portal.util.search.IndexSearch.search(Unknown Source)
at com.portal.util.search.IndexSearch.main(Unknown Source)

Has anyone run into this problem?


  2. +field:foo and the QueryParser:
 
 I ran into some problems where using +field:foo was 
 giving strange 
  results.  When I changed the queries to ... AND field:foo 
 everything 
  was fine.
 Am I missing something there?
 
 Which version of Lucene are you using? There have been some 
 fixes in the query parser of Lucene 1.2, but I don't know 
 precisely which.


I am using 1.3 RC 2.  The AND workaround is fine... just caught me
by surprise.


  3. I have some fields suck as title, owner, etc as well as 
 the content 
  blob which I index and use as the default search field.  Is 
 there an 
  easy way to extend the QueryParser to merge it with a 
 MultiTermQuery 
  which can also search this meta data and give them certain 
 weights?  
  Or, if you go down
 
 You can provide field weights at document indexing time 
 (norms) and use a MultiTermQuery for searching multiple 
 fields. At query time you can again use field weights.
 I don't know how the scoring of the MultiTermQuery is done, 
 it might use the max. score over the fields of a document, or 
 combine the scores in the fields of a document.


Yeah I will play with this.  For now adding the title to the main body does seem to 
work pretty
well,
so it may be good enough!

 
  this path do you have to leave the QueryParser behind and 
 build your 
  own queries?  Any best practices would be great.
 
 You have some options:
 - create the MultiTermQuery from the query text, or
 - index the default search field as a single field, eg. by 
 concatenation, and evt. by inserting empty tokens in between 
 to avoid proximity matches.
 This has also been discussed recently, see eg. the discussion 
 on indexing of sentences.
 
 Searching mutliple fields is normally a little slower than 
 searching a concatenated field. The actual difference depends 
 on you data, so you might experiment a bit. You might eg. 
 index all fields seperately, and also index a default 
 concatenated field.
 
 Kind regards,
 Ype Kingma


Thanks a lot for the great ideas.

Dion


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dates and others

2003-11-23 Thread Erik Hatcher
On Saturday, November 22, 2003, at 06:33  PM, Dion Almaer wrote:
3. I have some fields suck as title, owner, etc as well as the content 
blob which I index and use as
the default search field.  Is there an easy way to extend the 
QueryParser to merge it with a
MultiTermQuery which can also search this meta data and give them 
certain weights?  Or, if you go
down this path do you have to leave the QueryParser behind and build 
your own queries?  Any best
practices would be great.
And Ype said:
You can provide field weights at document indexing time (norms) and use 
a
MultiTermQuery for searching multiple fields. At query time you can
again use field weights.
I don't know how the scoring of the MultiTermQuery is done,
it might use the max. score over the fields of a document, or combine 
the
scores in the fields of a document.
 end Ype's reply cut and paste

I'm a little confused with this question and Ype's reply.  
MultiTermQuery is an abstract base class under Query, which is the 
parent for WildcardQuery and FuzzyQuery.

What I think you're after is using MultiFieldQueryParser, but you want 
to weight the fields differently.  You can add the boosts at indexing 
time using Field.setBoost.  Unfortunately at the moment 
MultiFieldQueryParser is not very extensible - there are some open 
issues with its subclassability but subclassing MFQP and overriding 
getFieldQuery will do the trick when the subclassing issues are 
resolved allowing you to boost at query time.

Making an educated guess at what you're doing with Lucene, Dion, I'd 
venture to say that boosting at indexing time is sufficient for your 
needs.

	Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dates and others

2003-11-23 Thread Erik Hatcher
On Sunday, November 23, 2003, at 03:33  PM, Dion Almaer wrote:
This leads me to another issue actually.  On certain range queries I 
get exceptions:

Query: modifieddate:[1/1/03 TO 12/31/03]

org.apache.lucene.search.BooleanQuery$TooManyClauses
I'm guessing you're using Field.Keyword(String, Date) for modifieddate? 
 The date field stuff in Lucene is really a timestamp, and doing a 
range query enumerates all the terms for that field in that ranging 
making a big ol' boolean OR query of all the individual ones.  Since 
you want this to be just a date, use Field.Keyword(String, MMDD) 
instead.  But you'll want to subclass QueryParser and override 
getRangeQuery to do the right date format parsing from MM/DD/ 
into MMDD rather than the internal Date representation Lucene 
uses for date fields.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dates and others

2003-11-23 Thread Erik Hatcher
On Sunday, November 23, 2003, at 03:33  PM, Dion Almaer wrote:
2. +field:foo and the QueryParser:

   I ran into some problems where using +field:foo was
giving strange
results.  When I changed the queries to ... AND field:foo
everything
was fine.
   Am I missing something there?
Which version of Lucene are you using? There have been some
fixes in the query parser of Lucene 1.2, but I don't know
precisely which.


I am using 1.3 RC 2.  The AND workaround is fine... just caught me
by surprise.
If you run into the strange results again, try doing a 
Query.toString(field) to see what QueryParser thought of the query 
expression.  I'd be curious to know more about this issue.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Dates and others

2003-11-22 Thread Dion Almaer

Hi guys -

First off I want to just give the Lucene project credit for producing an API like 
this.  Truly great
stuff.

I was just wondering if anyone could share some wisdom on a couple of issues:

1. The power of dates:
  
   I am fairly happy with the results of queries on my index.  The only issue I have 
is that at the
moment the date of the content isn't considered (since lucene doesn't know about it).  
Is there a
good way in which the date of the content could be used to help with the scoring?  So 
more recent
content shows up higher in the stack.  I have a date keyword field, but it isn't part 
of the query
itself.  Are there any patterns to help with this?

2. +field:foo and the QueryParser:

   I ran into some problems where using +field:foo was giving strange results.  When I 
changed the
queries to ... AND field:foo everything was fine.
   Am I missing something there?

3. I have some fields suck as title, owner, etc as well as the content blob which I 
index and use as
the default search field.  Is there an easy way to extend the QueryParser to merge it 
with a
MultiTermQuery which can also search this meta data and give them certain weights?  
Or, if you go
down this path do you have to leave the QueryParser behind and build your own queries? 
 Any best
practices would be great.

Sorry for bugging the list.

Cheers,

Dion



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]