Re: Lucene Scoring Behavior

2003-09-17 Thread Doug Cutting
Hmm.  This makes no sense to me.  Can you supply a reproducible 
standalone test case?

Doug

Terry Steichen wrote:
Doug,

(1) No, I did *not* boost the pub_date field, either in the indexing process
or in the query itself.
(2) And, each pub_date field of each document (which is in XML format)
contains only one instance of the date string.
(3) And only the pub_date field itself is indexed.  There are other
attributes of this field that may contain the date string, but they aren't
indexed - that is, they are not included in the instantiated Document class.
Regards,

Terry

- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 5:51 PM
Subject: Re: Lucene Scoring Behavior


Terry Steichen wrote:

 0.03125 = fieldNorm(field=pub_date, doc=90992)
 1.0 = fieldNorm(field=pub_date, doc=90970)
It looks like the fieldNorm's are what differ, not the IDFs.  These are
the product of the document and/or field boost, and 1/sqrt(numTerms)
where numTerms is the number of terms in the "pub_date" field of the
document.  Thus if each document is only assigned one date, and you
didn't boost the field or the document when you indexed it, this should
be 1.0.  But if the document has two dates, then this would be
1/sqrt(2).  Or if you boosted this document pub_date field, then this
will have whatever boost you provided.
So, did you boost anything when indexing?  Or could a single document
have two or more different values for pub_date?  Either would explain
this.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread Tatu Saloranta
On Wednesday 17 September 2003 07:07, Erik Hatcher wrote:
> On Wednesday, September 17, 2003, at 08:43  AM, Killeen, Tom wrote:
> > I would suggest XML as well.
>
> Again, I'd like to hear more about how you'd do this generically.  Tell
> me what the field names and values would correspond to when presented
> with an XML file.

Perhaps just one generic "content" field, which would contain tokenized
content from all XML segments. That could be done easily & efficiently
with just sax event handling? Since it's a simple demo, you can't get much
simpler than that, but it should still be fairly useful?
Attributes could/should be ignored by default; common practice for XML markup
seems to be for attributes not to contain any content that would make sense to 
index.

So I'd think just stripping out all tags (and comments, PIs etc) might be 
reasonable plain simple approach for demo app.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Scoring Behavior

2003-09-17 Thread Terry Steichen
Doug,

(1) No, I did *not* boost the pub_date field, either in the indexing process
or in the query itself.

(2) And, each pub_date field of each document (which is in XML format)
contains only one instance of the date string.

(3) And only the pub_date field itself is indexed.  There are other
attributes of this field that may contain the date string, but they aren't
indexed - that is, they are not included in the instantiated Document class.

Regards,

Terry

- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 5:51 PM
Subject: Re: Lucene Scoring Behavior


> Terry Steichen wrote:
> >   0.03125 = fieldNorm(field=pub_date, doc=90992)
> >   1.0 = fieldNorm(field=pub_date, doc=90970)
>
> It looks like the fieldNorm's are what differ, not the IDFs.  These are
> the product of the document and/or field boost, and 1/sqrt(numTerms)
> where numTerms is the number of terms in the "pub_date" field of the
> document.  Thus if each document is only assigned one date, and you
> didn't boost the field or the document when you indexed it, this should
> be 1.0.  But if the document has two dates, then this would be
> 1/sqrt(2).  Or if you boosted this document pub_date field, then this
> will have whatever boost you provided.
>
> So, did you boost anything when indexing?  Or could a single document
> have two or more different values for pub_date?  Either would explain
this.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Marco Tedone
Yeah, that would be great!
- Original Message - 
From: "Jeff Linwood" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 5:15 PM
Subject: Re: Lucene demo ideas?


> Paging would be great for the results.
>
> Jeff
> - Original Message - 
> From: "Erik Hatcher" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, September 17, 2003 7:00 AM
> Subject: Lucene demo ideas?
>
>
> > I'm about to start some refactorings on the web application demo that
> > ships with Lucene to show off its features and be usable more easily
> > and cleanly out of the box - i.e. just drop into Tomcat's webapps
> > directory and go.
> >
> > Does anyone have any suggestions on what they'd like to see in the demo
> > app?  Some of my ideas are:
> >
> > - Eliminate the need to do a command-line indexing, let the web app do
> > this upon command, allowing you to specify where the index lives (there
> > will be a reasonable default like ~/lucenedemo/index perhaps) and what
> > directory tree to index (perhaps defaulting to the root directory or
> > c:\, or where instead?)
> >
> > - Spin off a background indexing thread so the web app searching is
> > immediately useful after kicking off the indexing process, and allow a
> > status view of the indexing progress.
> >
> > - Index text and HTML files.  Any others?  I don't want to get into
> > putting too many dependencies in though - let's keep it relatively
> > simple, although still demonstrative.  Allow search filtering by last
> > modified date range and document type (extension).
> >
> > - Perhaps allow you to specify the analyzer to use when indexing.
> >
> > - Show the explanation of how scores are computed in the search results
> > as an option.
> >
> > I'm all ears to possibilities of improvements!  Send your wishlist.
> >
> > Erik
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Marco Tedone
I would have the code ready is wanted...
- Original Message - 
From: "Pitre, Russell" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 2:21 PM
Subject: RE: Lucene demo ideas?


I know this may be far fetched, but how about being able to index
.jsp'sI know this is a spindle thing, but It seems a lot of people
need this functionality.


My suggestion

Russ

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 17, 2003 8:01 AM
To: [EMAIL PROTECTED]
Subject: Lucene demo ideas?

I'm about to start some refactorings on the web application demo that 
ships with Lucene to show off its features and be usable more easily 
and cleanly out of the box - i.e. just drop into Tomcat's webapps 
directory and go.

Does anyone have any suggestions on what they'd like to see in the demo 
app?  Some of my ideas are:

- Eliminate the need to do a command-line indexing, let the web app do 
this upon command, allowing you to specify where the index lives (there 
will be a reasonable default like ~/lucenedemo/index perhaps) and what 
directory tree to index (perhaps defaulting to the root directory or 
c:\, or where instead?)

- Spin off a background indexing thread so the web app searching is 
immediately useful after kicking off the indexing process, and allow a 
status view of the indexing progress.

- Index text and HTML files.  Any others?  I don't want to get into 
putting too many dependencies in though - let's keep it relatively 
simple, although still demonstrative.  Allow search filtering by last 
modified date range and document type (extension).

- Perhaps allow you to specify the analyzer to use when indexing.

- Show the explanation of how scores are computed in the search results 
as an option.

I'm all ears to possibilities of improvements!  Send your wishlist.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Scoring Behavior

2003-09-17 Thread Doug Cutting
Terry Steichen wrote:
  0.03125 = fieldNorm(field=pub_date, doc=90992)
  1.0 = fieldNorm(field=pub_date, doc=90970)
It looks like the fieldNorm's are what differ, not the IDFs.  These are 
the product of the document and/or field boost, and 1/sqrt(numTerms) 
where numTerms is the number of terms in the "pub_date" field of the 
document.  Thus if each document is only assigned one date, and you 
didn't boost the field or the document when you indexed it, this should 
be 1.0.  But if the document has two dates, then this would be 
1/sqrt(2).  Or if you boosted this document pub_date field, then this 
will have whatever boost you provided.

So, did you boost anything when indexing?  Or could a single document 
have two or more different values for pub_date?  Either would explain this.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Scoring Behavior

2003-09-17 Thread Terry Steichen
Doug/Erik,

I do use RangeQuery to get a range of dates, but in this case I'm just
getting a single date (string), so I believe it's just a regular query I'm
using.

Per Erik's suggestion, I checked out the Explanation for some of these
anomolies.  I've included a condensation of the data it generated below
(which I don't frankly don't understand).  Perhaps that will give you or
Erik some insight into what's happening?

Regards,

Terry

PS: I note that the 'docFreq' parameters displayed below correspond exactly
to the number of hits for the query.  Also, here's the Similarity class I'm
using (per an earlier suggestion of Doug):

public class WESimilarity2 extends
org.apache.lucene.search.DefaultSimilarity {

 public float lengthNorm(String fieldName, int numTerms) {
  if (fieldName.equals("headline") || fieldName.equals("summary") ||
fieldName.equals("ssummary")){
   return 4.0f * super.lengthNorm(fieldName, Math.max(numTerms,750));
  } else {
   return super.lengthNorm(fieldName, Math.max(numTerms, 750));
  }
 }
}




Query #1: pub_date:20030917
All items: Score: .23000652
0.23000652 = weight(pub_date:20030917 in 91197), product of:
  0.9994 = queryWeight(pub_date:20030917), product of:
7.360209 = idf(docFreq=157)
0.1358657 = queryNorm
  0.23000653 = fieldWeight(pub_date:20030917 in 91197), product of:
1.0 = tf(termFreq(pub_date:20030917)=1)
7.360209 = idf(docFreq=157)
0.03125 = fieldNorm(field=pub_date, doc=91197)

Query #2: pub_date:20030916
All items: Score: .22295427
0.22295427 = fieldWeight(pub_date:20030916 in 90992), product of:
  1.0 = tf(termFreq(pub_date:20030916)=1)
  7.1345367 = idf(docFreq=197)
  0.03125 = fieldNorm(field=pub_date, doc=90992)


Query #3: pub_date:20030915
Items 1&2: Score: 1.0
7.2580175 = weight(pub_date:20030915 in 90970), product of:
  0.9994 = queryWeight(pub_date:20030915), product of:
7.258018 = idf(docFreq=174)
0.13777865 = queryNorm
  7.258018 = fieldWeight(pub_date:20030915 in 90970), product of:
1.0 = tf(termFreq(pub_date:20030915)=1)
7.258018 = idf(docFreq=174)
1.0 = fieldNorm(field=pub_date, doc=90970)

Query #3 (same as above): pub_date:20030915
Other items: Score: 03125
0.22681305 = weight(pub_date:20030915 in 90826), product of:
  0.9994 = queryWeight(pub_date:20030915), product of:
7.258018 = idf(docFreq=174)
0.13777865 = queryNorm
  0.22681306 = fieldWeight(pub_date:20030915 in 90826), product of:
1.0 = tf(termFreq(pub_date:20030915)=1)
7.258018 = idf(docFreq=174)
0.03125 = fieldNorm(field=pub_date, doc=90826)

Query #4: pub_date:20030914
0.21384604 = weight(pub_date:20030914 in 90417), product of:
  0.9994 = queryWeight(pub_date:20030914), product of:
6.843074 = idf(docFreq=264)
0.14613315 = queryNorm
  0.21384606 = fieldWeight(pub_date:20030914 in 90417), product of:
1.0 = tf(termFreq(pub_date:20030914)=1)
6.843074 = idf(docFreq=264)
0.03125 = fieldNorm(field=pub_date, doc=90417)

Query #5: pub_date 20030913
Items 1&2: Score: 1.0
7.366558 = fieldWeight(pub_date:20030913 in 90591), product of:
  1.0 = tf(termFreq(pub_date:20030913)=1)
  7.366558 = idf(docFreq=156)
  1.0 = fieldNorm(field=pub_date, doc=90591)

Query #5 (same as above): pub_date:20030913
Other items: Score: .03125
0.23020494 = fieldWeight(pub_date:20030913 in 90383), product of:
  1.0 = tf(termFreq(pub_date:20030913)=1)
  7.366558 = idf(docFreq=156)
  0.03125 = fieldNorm(field=pub_date, doc=90383)


- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 4:55 PM
Subject: Re: Lucene Scoring Behavior


> If you're using RangeQuery to do date searching, then you'll likely see
> unusual scoring.  The IDF of a date, like any other term, is inversely
> related to the number of documents with that date.  So documents whose
> dates are rare will score higher, which is probably not what you intend.
>
> Using a Filter for date searching is one way to remove dates from the
> scoring calculation.  Another is to provide a Similarity implementation
> that gives an IDF of 1.0 for terms from your date field, e.g., something
> like:
>
> public class MySimilarity extends DefaultSimilarity {
>public float idf(Term term, Searcher searcher) throws IOException {
>  if (term.field() == "date") {
>return 1.0f;
>  } else {
>return super.idf(term, searcher);
>  }
>}
> }
>
> Or you could just give date clauses of your query a very small boost
> (e.g., .0001) so that other clauses dominate the scoring.
>
> Doug
>
> Terry Steichen wrote:
> > I've run across some puzzling behavior regarding scoring.  I have a set
of documents which contain, among others, a date field

Re: Lucene Scoring Behavior

2003-09-17 Thread Doug Cutting
If you're using RangeQuery to do date searching, then you'll likely see 
unusual scoring.  The IDF of a date, like any other term, is inversely 
related to the number of documents with that date.  So documents whose 
dates are rare will score higher, which is probably not what you intend.

Using a Filter for date searching is one way to remove dates from the 
scoring calculation.  Another is to provide a Similarity implementation 
that gives an IDF of 1.0 for terms from your date field, e.g., something 
like:

public class MySimilarity extends DefaultSimilarity {
  public float idf(Term term, Searcher searcher) throws IOException {
if (term.field() == "date") {
  return 1.0f;
} else {
  return super.idf(term, searcher);
}
  }
}
Or you could just give date clauses of your query a very small boost 
(e.g., .0001) so that other clauses dominate the scoring.

Doug

Terry Steichen wrote:
I've run across some puzzling behavior regarding scoring.  I have a set of documents which contain, among others, a date field (whose contents is a string in the MMDD format).  When I query on the date 20030917 (that is, today), I get 157 hits, all of which have a score of .23000652.  If I use 20030916 (yesterday), I get 197 hits, each of which has a score of .22295427.

So far, all seems logical.  However, when I search for all records for the date 20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the hits have a score of .03125.  Here is a tabulation of these and a few more queries:

Query Date  Result
===
20030917all have a score of .23000652 (157)
20030916all have a score of .22295427 (197)
20030915first 2 have a 1.0 score, all rest are .03125 (174)
20030914all have a score of .21384604 (264)
20030913first 2 have a 1.0 score, all rest are .03125 (156)
20030912all have a score .2166833 (241)
20030911first 3 have a 1.0 score, all rest are .03125 (244)
20030910all have a score of  .2208193 (211)
I would expect that all the hits would have the same score, and I would expect it to be normalized to 1 (unless, I guess, the top score was less than 1, in which case normalization presumably doesn't occur).  

Does anyone have any ideas as to what might be going on here?  (I'm using the latest CVS sources, obtained this afternoon.)

Regards,

Terry



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Scoring Behavior

2003-09-17 Thread Erik Hatcher
Try using IndexSearcher.explain and dump out the contents of what it 
returns either as toString or toHtml (whichever format suits your 
environment best) and see what it has to say.  It'll give you the 
low-down on the factors involved in the score calculation.  I'm 
interested to see what you come up with.

	Erik

On Wednesday, September 17, 2003, at 03:33  PM, Terry Steichen wrote:

I've run across some puzzling behavior regarding scoring.  I have a 
set of documents which contain, among others, a date field (whose 
contents is a string in the MMDD format).  When I query on the 
date 20030917 (that is, today), I get 157 hits, all of which have a 
score of .23000652.  If I use 20030916 (yesterday), I get 197 hits, 
each of which has a score of .22295427.

So far, all seems logical.  However, when I search for all records for 
the date 20030915, the first two (of 174 hits) have a score of 1.0, 
while all the rest of the hits have a score of .03125.  Here is a 
tabulation of these and a few more queries:

Query Date  Result
=======
20030917all have a score of .23000652 (157)
20030916all have a score of .22295427 (197)
20030915first 2 have a 1.0 score, all rest are .03125 (174)
20030914all have a score of .21384604 (264)
20030913first 2 have a 1.0 score, all rest are .03125 (156)
20030912all have a score .2166833 (241)
20030911first 3 have a 1.0 score, all rest are .03125 (244)
20030910all have a score of  .2208193 (211)
I would expect that all the hits would have the same score, and I 
would expect it to be normalized to 1 (unless, I guess, the top score 
was less than 1, in which case normalization presumably doesn't > occur).

Does anyone have any ideas as to what might be going on here?  (I'm 
using the latest CVS sources, obtained this afternoon.)

Regards,

Terry


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene Scoring Behavior

2003-09-17 Thread Terry Steichen
I've run across some puzzling behavior regarding scoring.  I have a set of documents 
which contain, among others, a date field (whose contents is a string in the MMDD 
format).  When I query on the date 20030917 (that is, today), I get 157 hits, all of 
which have a score of .23000652.  If I use 20030916 (yesterday), I get 197 hits, each 
of which has a score of .22295427.

So far, all seems logical.  However, when I search for all records for the date 
20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the 
hits have a score of .03125.  Here is a tabulation of these and a few more queries:

Query Date  Result
===
20030917all have a score of .23000652 (157)
20030916all have a score of .22295427 (197)
20030915first 2 have a 1.0 score, all rest are .03125 (174)
20030914all have a score of .21384604 (264)
20030913first 2 have a 1.0 score, all rest are .03125 (156)
20030912all have a score .2166833 (241)
20030911first 3 have a 1.0 score, all rest are .03125 (244)
20030910all have a score of  .2208193 (211)

I would expect that all the hits would have the same score, and I would expect it to 
be normalized to 1 (unless, I guess, the top score was less than 1, in which case 
normalization presumably doesn't occur).  

Does anyone have any ideas as to what might be going on here?  (I'm using the latest 
CVS sources, obtained this afternoon.)

Regards,

Terry


Re: slow performance with Date Range Searching

2003-09-17 Thread Erik Hatcher
And with the latest Lucene codebase in CVS, you could also use a 
DateFilter wrapped inside a CachingWrapperFilter instead of a 
QueryFilter.  Just wanted to mention what is now available.

But I'll reiterate what Doug says... be sure to save off the filter 
instance so you don't take the filtering performance hit for the same 
date ranges on later searches.

	Erik

On Wednesday, September 17, 2003, at 11:57  AM, Doug Cutting wrote:

Killeen, Tom wrote:
My query would look something like this: LongTitle:killeen AND
LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO
"2002-04-04"] and it returned in 5.7 seconds
Does anyone have any suggestions for searching date ranges.  Our 
ranges will
generally be between a 3 - 7 year period.
If you use the same date range repeatedly then you can make things 
fast by replacing it with a filter that you re-use.  Try using a 
QueryFilter with your date range query as the query.  Save the 
QueryFilter object and use it again with future queries.  The first 
query will be slow, but subsequent queries will use the cached results 
of the first.

This is the recommended way to implement a "within the last month", 
"within the last year", etc. feature.  You could even pre-fetch the 
filter whenever you update the index, by evaluating a query before you 
put a new IndexReader in production, so that even the first real user 
query is fast.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread Jeff Linwood
Paging would be great for the results.

Jeff
- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 7:00 AM
Subject: Lucene demo ideas?


> I'm about to start some refactorings on the web application demo that 
> ships with Lucene to show off its features and be usable more easily 
> and cleanly out of the box - i.e. just drop into Tomcat's webapps 
> directory and go.
> 
> Does anyone have any suggestions on what they'd like to see in the demo 
> app?  Some of my ideas are:
> 
> - Eliminate the need to do a command-line indexing, let the web app do 
> this upon command, allowing you to specify where the index lives (there 
> will be a reasonable default like ~/lucenedemo/index perhaps) and what 
> directory tree to index (perhaps defaulting to the root directory or 
> c:\, or where instead?)
> 
> - Spin off a background indexing thread so the web app searching is 
> immediately useful after kicking off the indexing process, and allow a 
> status view of the indexing progress.
> 
> - Index text and HTML files.  Any others?  I don't want to get into 
> putting too many dependencies in though - let's keep it relatively 
> simple, although still demonstrative.  Allow search filtering by last 
> modified date range and document type (extension).
> 
> - Perhaps allow you to specify the analyzer to use when indexing.
> 
> - Show the explanation of how scores are computed in the search results 
> as an option.
> 
> I'm all ears to possibilities of improvements!  Send your wishlist.
> 
> Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: slow performance with Date Range Searching

2003-09-17 Thread Doug Cutting
Killeen, Tom wrote:
My query would look something like this: LongTitle:killeen AND
LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO
"2002-04-04"] and it returned in 
5.7 seconds

Does anyone have any suggestions for searching date ranges.  Our ranges will
generally be between a 3 - 7 year period.
If you use the same date range repeatedly then you can make things fast 
by replacing it with a filter that you re-use.  Try using a QueryFilter 
with your date range query as the query.  Save the QueryFilter object 
and use it again with future queries.  The first query will be slow, but 
subsequent queries will use the cached results of the first.

This is the recommended way to implement a "within the last month", 
"within the last year", etc. feature.  You could even pre-fetch the 
filter whenever you update the index, by evaluating a query before you 
put a new IndexReader in production, so that even the first real user 
query is fast.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread Erik Hatcher
On Wednesday, September 17, 2003, at 09:21  AM, Pitre, Russell wrote:
I know this may be far fetched, but how about being able to index
.jsp'sI know this is a spindle thing, but It seems a lot of people
need this functionality.
Like I communicated in a previous thread, indexing JSP's just has a 
"smell" to it for me.  I can't argue with the pragmatic way others have 
done it by crawling, but I don't think of JSP's as "content" and I'd 
rather index actual content, that may or may not be later presented 
within a JSP.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread Erik Hatcher
On Wednesday, September 17, 2003, at 09:31  AM, Bryan LaPlante wrote:
I would like to see the taglib for searching the index in the demo. 
There is
an html form page and result page already built for the taglib that 
allows
you to change search params and demonstrates a fair amount of the 
search
capability of Lucene.
Bryan, no offense... but I won't be using the taglib in the demo.  I 
just don't feel accessing a Lucene index via a taglib is the right way 
to do things.  Coupling an index to JSP in that manner is too tight for 
my tastes.  What happens if you want to use Velocity for presentation?  
Or a Swing app?  See what I mean?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread Andrzej Bialecki
Erik Hatcher wrote:
On Wednesday, September 17, 2003, at 08:42  AM, Ben Litchfield wrote:

What, no PDF files!!


Haha!

http://www.pdfbox.org


And I've used pdfbox before - its cool.

And I'm cool with adding PDF and Word indexing to the demo personally, 
but I didn't want to increase the "weight" of the demo application.  If 
folks feel strongly about it then I'll incorporate it.
A word of warning: PDFBox is fantastic, I agree - but some PDFs are not 
so... In my application I experienced numerous hangs when PDFBox would 
start parsing some PDFs (I can send the files to Ben if required), and 
then got stuck in an infinite wait somewhere... So I came up with a 
workaround: I run the parser in a separate thread, while waiting in the 
main thread, and then after a certain timeout I kill the processing 
thread and return.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread hui
I think all the attribute values together with element text values should be
indexed in the "content" part. Also a xml map file could be used to pick up
the nodes need to be indexed separately so we do not create too many fields
by indexing non-critical nodes separately. Simple xpath could be used for
the map source, the field name and index type should be the map target.

Regards,
Hui

- Original Message - 
From: "Robert Koberg" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 10:09 AM
Subject: RE: Lucene demo ideas?


> Hi,
>
> Here are a couple of ideas for XML demos:
>
> 1. simply index the content into one 'content' field. Don't worry about
> attributes.
>
> 2. index a linked Dublin core meta data file:
> 
> And add fields for every element after rdf:Description
>
> Best,
> -Rob
>
>
>
> > -Original Message-
> > From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, September 17, 2003 6:08 AM
> > To: Lucene Users List
> >
> > On Wednesday, September 17, 2003, at 08:43  AM, Killeen, Tom wrote:
> > > I would suggest XML as well.
> >
> > Again, I'd like to hear more about how you'd do this generically.  Tell
> > me what the field names and values would correspond to when presented
> > with an XML file.
> >
> > Erik
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Eric Jain
> Does anyone have any suggestions on what they'd like to see in the
> demo app?

Show how lucene can 1) do incremental indexing, 2) isn't restricted to
indexing file system resources and 3) can store and query arbitrary
fields. These are in my opinion the features where most other search
engines fall flat.

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: slow performance with Date Range Searching

2003-09-17 Thread Eric Jain
> Does anyone have any suggestions for searching date ranges.  Our
> ranges will generally be between a 3 - 7 year period.

Apparently Lucene expands ranges to boolean 'or' queries. So if you have
a thousand distinct dates within a range, Lucene will build a query with
a thousand terms...

One workaround is to first run the query, but without ranges. Store the
results into a bit vector with the position corresponding to the
document id. Then create a TermEnum that starts with the lower range
value. For each term, get the document ids, and set the corresponding
values in a second bit vector. Break the loop as soon as the TermEnum
has reached or passed the upper limit of the range. Finally, 'and' or
'or' the second bit vector with the first one. It's as simple as that
:-)

I wonder why Lucene doesn't use this strategy by default. I realize it
is less efficient when the range includes few terms, but it seems to
scale far better.

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



wildcard search and german umlauts

2003-09-17 Thread Hackl, Rene
Hi All,

has someone ever written an extension of QueryParser providing the
possibility to let wildcard search terms be run through an analyzer ( as
suggested by Tatu Saloranta a while ago)? I want to reduce german umlauts to
their base letters (eg. 'ä' (ä) to 'a' ) and for non-wildcard queries
my UmlautFilter does its job. Or is there any other option?

Many thanks,
René

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: slow performance with Date Range Searching

2003-09-17 Thread Dan Quaroni
I don't know how lucene handles date ranges, but I was having very poor
results using booleans between different because of the way lucene handles
them.  What lucene does is that it evaluates each field in the query
separately and retrieves all of the results, then it evaluates the boolean
joins between the different fields.

So I believe the way lucene is handling the query is:

Get all the documents whose LongTitle has killeen in them
Get all the documents whose LongTitle has state in them
Get all the documents whose StateDistrict has id in them
Get all the documents filed between 1997-01-01 and 2002-04-04

(This, incidently, takes up a huge amount of memory)

Finally, it evaluates the booleans figures out which documents satisfy all
of your criteria and returns that to you.


I'm working on a matching engine that takes company information like their
name and address and finds our record for that company.  I ended up making a
separate index for every state and country because it was running too slow
and running out of memory when I was using booleans between fields. 

Maybe you could do something similar with your dates.  (i.e. one index per
year)


-Original Message-
From: Killeen, Tom [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 17, 2003 10:01 AM
To: 'Lucene Users List'
Subject: slow performance with Date Range Searching


Hello all, 

I have recently indexed approx 15.8 million XML documents in which I index
the contents certain elements (titles, states, dates to name a few).  I have
27 separate indices and use a MultiSearcher to search these indices.  

When I search on the title and state fields with multiple terms searching is
very fast.  For example I get a hit count of 227, 000 in .4 seconds.  But
when I throw a date range in the search, performance suffers significantly.


My query would look something like this: LongTitle:killeen AND
LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO
"2002-04-04"] and it returned in 
5.7 seconds


Does anyone have any suggestions for searching date ranges.  Our ranges will
generally be between a 3 - 7 year period.

thanks, 
Tom

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene demo ideas?

2003-09-17 Thread Robert Koberg
Hi,

Here are a couple of ideas for XML demos:

1. simply index the content into one 'content' field. Don't worry about
attributes.

2. index a linked Dublin core meta data file:

And add fields for every element after rdf:Description

Best,
-Rob



> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 17, 2003 6:08 AM
> To: Lucene Users List
> 
> On Wednesday, September 17, 2003, at 08:43  AM, Killeen, Tom wrote:
> > I would suggest XML as well.
> 
> Again, I'd like to hear more about how you'd do this generically.  Tell
> me what the field names and values would correspond to when presented
> with an XML file.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



slow performance with Date Range Searching

2003-09-17 Thread Killeen, Tom
Hello all, 

I have recently indexed approx 15.8 million XML documents in which I index
the contents certain elements (titles, states, dates to name a few).  I have
27 separate indices and use a MultiSearcher to search these indices.  

When I search on the title and state fields with multiple terms searching is
very fast.  For example I get a hit count of 227, 000 in .4 seconds.  But
when I throw a date range in the search, performance suffers significantly.


My query would look something like this: LongTitle:killeen AND
LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO
"2002-04-04"] and it returned in 
5.7 seconds


Does anyone have any suggestions for searching date ranges.  Our ranges will
generally be between a 3 - 7 year period.

thanks, 
Tom

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene demo ideas?

2003-09-17 Thread Pitre, Russell
I know this may be far fetched, but how about being able to index
.jsp'sI know this is a spindle thing, but It seems a lot of people
need this functionality.


My suggestion

Russ

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 17, 2003 8:01 AM
To: [EMAIL PROTECTED]
Subject: Lucene demo ideas?

I'm about to start some refactorings on the web application demo that 
ships with Lucene to show off its features and be usable more easily 
and cleanly out of the box - i.e. just drop into Tomcat's webapps 
directory and go.

Does anyone have any suggestions on what they'd like to see in the demo 
app?  Some of my ideas are:

- Eliminate the need to do a command-line indexing, let the web app do 
this upon command, allowing you to specify where the index lives (there 
will be a reasonable default like ~/lucenedemo/index perhaps) and what 
directory tree to index (perhaps defaulting to the root directory or 
c:\, or where instead?)

- Spin off a background indexing thread so the web app searching is 
immediately useful after kicking off the indexing process, and allow a 
status view of the indexing progress.

- Index text and HTML files.  Any others?  I don't want to get into 
putting too many dependencies in though - let's keep it relatively 
simple, although still demonstrative.  Allow search filtering by last 
modified date range and document type (extension).

- Perhaps allow you to specify the analyzer to use when indexing.

- Show the explanation of how scores are computed in the search results 
as an option.

I'm all ears to possibilities of improvements!  Send your wishlist.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Bryan LaPlante
I would like to see the taglib for searching the index in the demo. There is
an html form page and result page already built for the taglib that allows
you to change search params and demonstrates a fair amount of the search
capability of Lucene.

- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 7:00 AM
Subject: Lucene demo ideas?


> I'm about to start some refactorings on the web application demo that
> ships with Lucene to show off its features and be usable more easily
> and cleanly out of the box - i.e. just drop into Tomcat's webapps
> directory and go.
>
> Does anyone have any suggestions on what they'd like to see in the demo
> app?  Some of my ideas are:
>
> - Eliminate the need to do a command-line indexing, let the web app do
> this upon command, allowing you to specify where the index lives (there
> will be a reasonable default like ~/lucenedemo/index perhaps) and what
> directory tree to index (perhaps defaulting to the root directory or
> c:\, or where instead?)
>
> - Spin off a background indexing thread so the web app searching is
> immediately useful after kicking off the indexing process, and allow a
> status view of the indexing progress.
>
> - Index text and HTML files.  Any others?  I don't want to get into
> putting too many dependencies in though - let's keep it relatively
> simple, although still demonstrative.  Allow search filtering by last
> modified date range and document type (extension).
>
> - Perhaps allow you to specify the analyzer to use when indexing.
>
> - Show the explanation of how scores are computed in the search results
> as an option.
>
> I'm all ears to possibilities of improvements!  Send your wishlist.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Pete Lewis
Might want two demos, one for Unix environments and one for Windows.

Most users will want a fast start that they can copy and adapt.  So quick
targets would be:

filesystems - html / text / pdf / office documents for windows.
xml - fairly simple example maybe against news items.
database - again simple maybe a pseudo employee database.
website - accessable from the filesystem.
website - that requires crawling.

Show hit markup.

Pete

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 1:00 PM
Subject: Lucene demo ideas?


> I'm about to start some refactorings on the web application demo that
> ships with Lucene to show off its features and be usable more easily
> and cleanly out of the box - i.e. just drop into Tomcat's webapps
> directory and go.
>
> Does anyone have any suggestions on what they'd like to see in the demo
> app?  Some of my ideas are:
>
> - Eliminate the need to do a command-line indexing, let the web app do
> this upon command, allowing you to specify where the index lives (there
> will be a reasonable default like ~/lucenedemo/index perhaps) and what
> directory tree to index (perhaps defaulting to the root directory or
> c:\, or where instead?)
>
> - Spin off a background indexing thread so the web app searching is
> immediately useful after kicking off the indexing process, and allow a
> status view of the indexing progress.
>
> - Index text and HTML files.  Any others?  I don't want to get into
> putting too many dependencies in though - let's keep it relatively
> simple, although still demonstrative.  Allow search filtering by last
> modified date range and document type (extension).
>
> - Perhaps allow you to specify the analyzer to use when indexing.
>
> - Show the explanation of how scores are computed in the search results
> as an option.
>
> I'm all ears to possibilities of improvements!  Send your wishlist.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Erik Hatcher
On Wednesday, September 17, 2003, at 08:43  AM, Killeen, Tom wrote:
I would suggest XML as well.
Again, I'd like to hear more about how you'd do this generically.  Tell 
me what the field names and values would correspond to when presented 
with an XML file.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread Erik Hatcher
On Wednesday, September 17, 2003, at 08:42  AM, Ben Litchfield wrote:
What, no PDF files!!
Haha!

http://www.pdfbox.org
And I've used pdfbox before - its cool.

And I'm cool with adding PDF and Word indexing to the demo personally, 
but I didn't want to increase the "weight" of the demo application.  If 
folks feel strongly about it then I'll incorporate it.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene demo ideas?

2003-09-17 Thread Erik Hatcher
please keep the discussions on the lucene-user e-mail list.

of course the source code will be available... what is there is already 
in lucene's CVS and i will just revamp what is there and commit it.  
and when we make lucene releases it will be bundled and made available 
as a single download too.

as for indexing XML files that is a possibility, but that is a 
broad request.  how would they be indexed?  every element made a field? 
 every attribute too?  what are the field names?  is this really 
appropriate for a "demo"?

On Wednesday, September 17, 2003, at 08:42  AM, Senthil Kumar K wrote:
hi erik,

  Is it possible to send a source code for the lucene demo
u proposed and i want to index xml files in my application.
All i have to do from browser. I have to avoid the command
line indexing.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Lucene demo ideas?

2003-09-17 Thread Killeen, Tom
I would suggest XML as well.


Tom

-Original Message-
From: Ben Litchfield [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 17, 2003 7:42 AM
To: Lucene Users List
Subject: Re: Lucene demo ideas?



> - Index text and HTML files.  Any others?


What, no PDF files!!

Ben

--
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Ben Litchfield

> - Index text and HTML files.  Any others?


What, no PDF files!!

Ben

--
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene demo ideas?

2003-09-17 Thread Erik Hatcher
I'm about to start some refactorings on the web application demo that 
ships with Lucene to show off its features and be usable more easily 
and cleanly out of the box - i.e. just drop into Tomcat's webapps 
directory and go.

Does anyone have any suggestions on what they'd like to see in the demo 
app?  Some of my ideas are:

- Eliminate the need to do a command-line indexing, let the web app do 
this upon command, allowing you to specify where the index lives (there 
will be a reasonable default like ~/lucenedemo/index perhaps) and what 
directory tree to index (perhaps defaulting to the root directory or 
c:\, or where instead?)

- Spin off a background indexing thread so the web app searching is 
immediately useful after kicking off the indexing process, and allow a 
status view of the indexing progress.

- Index text and HTML files.  Any others?  I don't want to get into 
putting too many dependencies in though - let's keep it relatively 
simple, although still demonstrative.  Allow search filtering by last 
modified date range and document type (extension).

- Perhaps allow you to specify the analyzer to use when indexing.

- Show the explanation of how scores are computed in the search results 
as an option.

I'm all ears to possibilities of improvements!  Send your wishlist.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-17 Thread Otis Gospodnetic
That would be nice.  Contributions are always welcome.

Otis

--- Chris Sibert <[EMAIL PROTECTED]> wrote:
> Thanks for all the replies. I feel reassured with using Lucene. If I
> end up
> doing anything with the application that I'm writing, I would like to
> look
> at contributing some documentation of Lucene's features, and what it
> has to
> offer.
> 
> - Original Message - 
> From: "Leo Galambos" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, September 11, 2003 4:57 PM
> Subject: Re: Lucene features
> 
> 
> > Doug Cutting wrote:
> >
> > >
> > > I have some extensions to Lucene that I've not yet commited which
> make
> > > it possible to easily define synthetic IndexReaders (not
> currently
> > > supported).  So you could do things that way, once I check these
> in.
> > > But is this really better than just ANDing the clauses together? 
> It
> > > would take some big experiments to know, but my guess is that it
> > > doesn't make much difference to compute a "local" IDF for such
> things.
> >
> >
> > In this case, I think that the operator would be evaluated as "an
> > implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> > Obviously, you have to use an filter to filter out false hits (in
> case
> > of q1->q2, the formula is true when q1 is false, so it is not what
> you
> > really need), but it is not an issue with the auxiliary index. On
> the
> > other hand, it is a feeling and it needs a test, you are right.
> >
> > Leo
> >
> >
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]