Re: Lucene Scoring Behavior
Hmm. This makes no sense to me. Can you supply a reproducible standalone test case? Doug Terry Steichen wrote: Doug, (1) No, I did *not* boost the pub_date field, either in the indexing process or in the query itself. (2) And, each pub_date field of each document (which is in XML format) contains only one instance of the date string. (3) And only the pub_date field itself is indexed. There are other attributes of this field that may contain the date string, but they aren't indexed - that is, they are not included in the instantiated Document class. Regards, Terry - Original Message - From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 5:51 PM Subject: Re: Lucene Scoring Behavior Terry Steichen wrote: 0.03125 = fieldNorm(field=pub_date, doc=90992) 1.0 = fieldNorm(field=pub_date, doc=90970) It looks like the fieldNorm's are what differ, not the IDFs. These are the product of the document and/or field boost, and 1/sqrt(numTerms) where numTerms is the number of terms in the "pub_date" field of the document. Thus if each document is only assigned one date, and you didn't boost the field or the document when you indexed it, this should be 1.0. But if the document has two dates, then this would be 1/sqrt(2). Or if you boosted this document pub_date field, then this will have whatever boost you provided. So, did you boost anything when indexing? Or could a single document have two or more different values for pub_date? Either would explain this. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wednesday 17 September 2003 07:07, Erik Hatcher wrote: > On Wednesday, September 17, 2003, at 08:43 AM, Killeen, Tom wrote: > > I would suggest XML as well. > > Again, I'd like to hear more about how you'd do this generically. Tell > me what the field names and values would correspond to when presented > with an XML file. Perhaps just one generic "content" field, which would contain tokenized content from all XML segments. That could be done easily & efficiently with just sax event handling? Since it's a simple demo, you can't get much simpler than that, but it should still be fairly useful? Attributes could/should be ignored by default; common practice for XML markup seems to be for attributes not to contain any content that would make sense to index. So I'd think just stripping out all tags (and comments, PIs etc) might be reasonable plain simple approach for demo app. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Scoring Behavior
Doug, (1) No, I did *not* boost the pub_date field, either in the indexing process or in the query itself. (2) And, each pub_date field of each document (which is in XML format) contains only one instance of the date string. (3) And only the pub_date field itself is indexed. There are other attributes of this field that may contain the date string, but they aren't indexed - that is, they are not included in the instantiated Document class. Regards, Terry - Original Message - From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 5:51 PM Subject: Re: Lucene Scoring Behavior > Terry Steichen wrote: > > 0.03125 = fieldNorm(field=pub_date, doc=90992) > > 1.0 = fieldNorm(field=pub_date, doc=90970) > > It looks like the fieldNorm's are what differ, not the IDFs. These are > the product of the document and/or field boost, and 1/sqrt(numTerms) > where numTerms is the number of terms in the "pub_date" field of the > document. Thus if each document is only assigned one date, and you > didn't boost the field or the document when you indexed it, this should > be 1.0. But if the document has two dates, then this would be > 1/sqrt(2). Or if you boosted this document pub_date field, then this > will have whatever boost you provided. > > So, did you boost anything when indexing? Or could a single document > have two or more different values for pub_date? Either would explain this. > > Doug > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
Yeah, that would be great! - Original Message - From: "Jeff Linwood" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 5:15 PM Subject: Re: Lucene demo ideas? > Paging would be great for the results. > > Jeff > - Original Message - > From: "Erik Hatcher" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Wednesday, September 17, 2003 7:00 AM > Subject: Lucene demo ideas? > > > > I'm about to start some refactorings on the web application demo that > > ships with Lucene to show off its features and be usable more easily > > and cleanly out of the box - i.e. just drop into Tomcat's webapps > > directory and go. > > > > Does anyone have any suggestions on what they'd like to see in the demo > > app? Some of my ideas are: > > > > - Eliminate the need to do a command-line indexing, let the web app do > > this upon command, allowing you to specify where the index lives (there > > will be a reasonable default like ~/lucenedemo/index perhaps) and what > > directory tree to index (perhaps defaulting to the root directory or > > c:\, or where instead?) > > > > - Spin off a background indexing thread so the web app searching is > > immediately useful after kicking off the indexing process, and allow a > > status view of the indexing progress. > > > > - Index text and HTML files. Any others? I don't want to get into > > putting too many dependencies in though - let's keep it relatively > > simple, although still demonstrative. Allow search filtering by last > > modified date range and document type (extension). > > > > - Perhaps allow you to specify the analyzer to use when indexing. > > > > - Show the explanation of how scores are computed in the search results > > as an option. > > > > I'm all ears to possibilities of improvements! Send your wishlist. > > > > Erik > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
I would have the code ready is wanted... - Original Message - From: "Pitre, Russell" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 2:21 PM Subject: RE: Lucene demo ideas? I know this may be far fetched, but how about being able to index .jsp'sI know this is a spindle thing, but It seems a lot of people need this functionality. My suggestion Russ -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 17, 2003 8:01 AM To: [EMAIL PROTECTED] Subject: Lucene demo ideas? I'm about to start some refactorings on the web application demo that ships with Lucene to show off its features and be usable more easily and cleanly out of the box - i.e. just drop into Tomcat's webapps directory and go. Does anyone have any suggestions on what they'd like to see in the demo app? Some of my ideas are: - Eliminate the need to do a command-line indexing, let the web app do this upon command, allowing you to specify where the index lives (there will be a reasonable default like ~/lucenedemo/index perhaps) and what directory tree to index (perhaps defaulting to the root directory or c:\, or where instead?) - Spin off a background indexing thread so the web app searching is immediately useful after kicking off the indexing process, and allow a status view of the indexing progress. - Index text and HTML files. Any others? I don't want to get into putting too many dependencies in though - let's keep it relatively simple, although still demonstrative. Allow search filtering by last modified date range and document type (extension). - Perhaps allow you to specify the analyzer to use when indexing. - Show the explanation of how scores are computed in the search results as an option. I'm all ears to possibilities of improvements! Send your wishlist. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Scoring Behavior
Terry Steichen wrote: 0.03125 = fieldNorm(field=pub_date, doc=90992) 1.0 = fieldNorm(field=pub_date, doc=90970) It looks like the fieldNorm's are what differ, not the IDFs. These are the product of the document and/or field boost, and 1/sqrt(numTerms) where numTerms is the number of terms in the "pub_date" field of the document. Thus if each document is only assigned one date, and you didn't boost the field or the document when you indexed it, this should be 1.0. But if the document has two dates, then this would be 1/sqrt(2). Or if you boosted this document pub_date field, then this will have whatever boost you provided. So, did you boost anything when indexing? Or could a single document have two or more different values for pub_date? Either would explain this. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Scoring Behavior
Doug/Erik, I do use RangeQuery to get a range of dates, but in this case I'm just getting a single date (string), so I believe it's just a regular query I'm using. Per Erik's suggestion, I checked out the Explanation for some of these anomolies. I've included a condensation of the data it generated below (which I don't frankly don't understand). Perhaps that will give you or Erik some insight into what's happening? Regards, Terry PS: I note that the 'docFreq' parameters displayed below correspond exactly to the number of hits for the query. Also, here's the Similarity class I'm using (per an earlier suggestion of Doug): public class WESimilarity2 extends org.apache.lucene.search.DefaultSimilarity { public float lengthNorm(String fieldName, int numTerms) { if (fieldName.equals("headline") || fieldName.equals("summary") || fieldName.equals("ssummary")){ return 4.0f * super.lengthNorm(fieldName, Math.max(numTerms,750)); } else { return super.lengthNorm(fieldName, Math.max(numTerms, 750)); } } } Query #1: pub_date:20030917 All items: Score: .23000652 0.23000652 = weight(pub_date:20030917 in 91197), product of: 0.9994 = queryWeight(pub_date:20030917), product of: 7.360209 = idf(docFreq=157) 0.1358657 = queryNorm 0.23000653 = fieldWeight(pub_date:20030917 in 91197), product of: 1.0 = tf(termFreq(pub_date:20030917)=1) 7.360209 = idf(docFreq=157) 0.03125 = fieldNorm(field=pub_date, doc=91197) Query #2: pub_date:20030916 All items: Score: .22295427 0.22295427 = fieldWeight(pub_date:20030916 in 90992), product of: 1.0 = tf(termFreq(pub_date:20030916)=1) 7.1345367 = idf(docFreq=197) 0.03125 = fieldNorm(field=pub_date, doc=90992) Query #3: pub_date:20030915 Items 1&2: Score: 1.0 7.2580175 = weight(pub_date:20030915 in 90970), product of: 0.9994 = queryWeight(pub_date:20030915), product of: 7.258018 = idf(docFreq=174) 0.13777865 = queryNorm 7.258018 = fieldWeight(pub_date:20030915 in 90970), product of: 1.0 = tf(termFreq(pub_date:20030915)=1) 7.258018 = idf(docFreq=174) 1.0 = fieldNorm(field=pub_date, doc=90970) Query #3 (same as above): pub_date:20030915 Other items: Score: 03125 0.22681305 = weight(pub_date:20030915 in 90826), product of: 0.9994 = queryWeight(pub_date:20030915), product of: 7.258018 = idf(docFreq=174) 0.13777865 = queryNorm 0.22681306 = fieldWeight(pub_date:20030915 in 90826), product of: 1.0 = tf(termFreq(pub_date:20030915)=1) 7.258018 = idf(docFreq=174) 0.03125 = fieldNorm(field=pub_date, doc=90826) Query #4: pub_date:20030914 0.21384604 = weight(pub_date:20030914 in 90417), product of: 0.9994 = queryWeight(pub_date:20030914), product of: 6.843074 = idf(docFreq=264) 0.14613315 = queryNorm 0.21384606 = fieldWeight(pub_date:20030914 in 90417), product of: 1.0 = tf(termFreq(pub_date:20030914)=1) 6.843074 = idf(docFreq=264) 0.03125 = fieldNorm(field=pub_date, doc=90417) Query #5: pub_date 20030913 Items 1&2: Score: 1.0 7.366558 = fieldWeight(pub_date:20030913 in 90591), product of: 1.0 = tf(termFreq(pub_date:20030913)=1) 7.366558 = idf(docFreq=156) 1.0 = fieldNorm(field=pub_date, doc=90591) Query #5 (same as above): pub_date:20030913 Other items: Score: .03125 0.23020494 = fieldWeight(pub_date:20030913 in 90383), product of: 1.0 = tf(termFreq(pub_date:20030913)=1) 7.366558 = idf(docFreq=156) 0.03125 = fieldNorm(field=pub_date, doc=90383) - Original Message - From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 4:55 PM Subject: Re: Lucene Scoring Behavior > If you're using RangeQuery to do date searching, then you'll likely see > unusual scoring. The IDF of a date, like any other term, is inversely > related to the number of documents with that date. So documents whose > dates are rare will score higher, which is probably not what you intend. > > Using a Filter for date searching is one way to remove dates from the > scoring calculation. Another is to provide a Similarity implementation > that gives an IDF of 1.0 for terms from your date field, e.g., something > like: > > public class MySimilarity extends DefaultSimilarity { >public float idf(Term term, Searcher searcher) throws IOException { > if (term.field() == "date") { >return 1.0f; > } else { >return super.idf(term, searcher); > } >} > } > > Or you could just give date clauses of your query a very small boost > (e.g., .0001) so that other clauses dominate the scoring. > > Doug > > Terry Steichen wrote: > > I've run across some puzzling behavior regarding scoring. I have a set of documents which contain, among others, a date field
Re: Lucene Scoring Behavior
If you're using RangeQuery to do date searching, then you'll likely see unusual scoring. The IDF of a date, like any other term, is inversely related to the number of documents with that date. So documents whose dates are rare will score higher, which is probably not what you intend. Using a Filter for date searching is one way to remove dates from the scoring calculation. Another is to provide a Similarity implementation that gives an IDF of 1.0 for terms from your date field, e.g., something like: public class MySimilarity extends DefaultSimilarity { public float idf(Term term, Searcher searcher) throws IOException { if (term.field() == "date") { return 1.0f; } else { return super.idf(term, searcher); } } } Or you could just give date clauses of your query a very small boost (e.g., .0001) so that other clauses dominate the scoring. Doug Terry Steichen wrote: I've run across some puzzling behavior regarding scoring. I have a set of documents which contain, among others, a date field (whose contents is a string in the MMDD format). When I query on the date 20030917 (that is, today), I get 157 hits, all of which have a score of .23000652. If I use 20030916 (yesterday), I get 197 hits, each of which has a score of .22295427. So far, all seems logical. However, when I search for all records for the date 20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the hits have a score of .03125. Here is a tabulation of these and a few more queries: Query Date Result === 20030917all have a score of .23000652 (157) 20030916all have a score of .22295427 (197) 20030915first 2 have a 1.0 score, all rest are .03125 (174) 20030914all have a score of .21384604 (264) 20030913first 2 have a 1.0 score, all rest are .03125 (156) 20030912all have a score .2166833 (241) 20030911first 3 have a 1.0 score, all rest are .03125 (244) 20030910all have a score of .2208193 (211) I would expect that all the hits would have the same score, and I would expect it to be normalized to 1 (unless, I guess, the top score was less than 1, in which case normalization presumably doesn't occur). Does anyone have any ideas as to what might be going on here? (I'm using the latest CVS sources, obtained this afternoon.) Regards, Terry - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Scoring Behavior
Try using IndexSearcher.explain and dump out the contents of what it returns either as toString or toHtml (whichever format suits your environment best) and see what it has to say. It'll give you the low-down on the factors involved in the score calculation. I'm interested to see what you come up with. Erik On Wednesday, September 17, 2003, at 03:33 PM, Terry Steichen wrote: I've run across some puzzling behavior regarding scoring. I have a set of documents which contain, among others, a date field (whose contents is a string in the MMDD format). When I query on the date 20030917 (that is, today), I get 157 hits, all of which have a score of .23000652. If I use 20030916 (yesterday), I get 197 hits, each of which has a score of .22295427. So far, all seems logical. However, when I search for all records for the date 20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the hits have a score of .03125. Here is a tabulation of these and a few more queries: Query Date Result ======= 20030917all have a score of .23000652 (157) 20030916all have a score of .22295427 (197) 20030915first 2 have a 1.0 score, all rest are .03125 (174) 20030914all have a score of .21384604 (264) 20030913first 2 have a 1.0 score, all rest are .03125 (156) 20030912all have a score .2166833 (241) 20030911first 3 have a 1.0 score, all rest are .03125 (244) 20030910all have a score of .2208193 (211) I would expect that all the hits would have the same score, and I would expect it to be normalized to 1 (unless, I guess, the top score was less than 1, in which case normalization presumably doesn't > occur). Does anyone have any ideas as to what might be going on here? (I'm using the latest CVS sources, obtained this afternoon.) Regards, Terry - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Scoring Behavior
I've run across some puzzling behavior regarding scoring. I have a set of documents which contain, among others, a date field (whose contents is a string in the MMDD format). When I query on the date 20030917 (that is, today), I get 157 hits, all of which have a score of .23000652. If I use 20030916 (yesterday), I get 197 hits, each of which has a score of .22295427. So far, all seems logical. However, when I search for all records for the date 20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the hits have a score of .03125. Here is a tabulation of these and a few more queries: Query Date Result === 20030917all have a score of .23000652 (157) 20030916all have a score of .22295427 (197) 20030915first 2 have a 1.0 score, all rest are .03125 (174) 20030914all have a score of .21384604 (264) 20030913first 2 have a 1.0 score, all rest are .03125 (156) 20030912all have a score .2166833 (241) 20030911first 3 have a 1.0 score, all rest are .03125 (244) 20030910all have a score of .2208193 (211) I would expect that all the hits would have the same score, and I would expect it to be normalized to 1 (unless, I guess, the top score was less than 1, in which case normalization presumably doesn't occur). Does anyone have any ideas as to what might be going on here? (I'm using the latest CVS sources, obtained this afternoon.) Regards, Terry
Re: slow performance with Date Range Searching
And with the latest Lucene codebase in CVS, you could also use a DateFilter wrapped inside a CachingWrapperFilter instead of a QueryFilter. Just wanted to mention what is now available. But I'll reiterate what Doug says... be sure to save off the filter instance so you don't take the filtering performance hit for the same date ranges on later searches. Erik On Wednesday, September 17, 2003, at 11:57 AM, Doug Cutting wrote: Killeen, Tom wrote: My query would look something like this: LongTitle:killeen AND LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO "2002-04-04"] and it returned in 5.7 seconds Does anyone have any suggestions for searching date ranges. Our ranges will generally be between a 3 - 7 year period. If you use the same date range repeatedly then you can make things fast by replacing it with a filter that you re-use. Try using a QueryFilter with your date range query as the query. Save the QueryFilter object and use it again with future queries. The first query will be slow, but subsequent queries will use the cached results of the first. This is the recommended way to implement a "within the last month", "within the last year", etc. feature. You could even pre-fetch the filter whenever you update the index, by evaluating a query before you put a new IndexReader in production, so that even the first real user query is fast. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
Paging would be great for the results. Jeff - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 7:00 AM Subject: Lucene demo ideas? > I'm about to start some refactorings on the web application demo that > ships with Lucene to show off its features and be usable more easily > and cleanly out of the box - i.e. just drop into Tomcat's webapps > directory and go. > > Does anyone have any suggestions on what they'd like to see in the demo > app? Some of my ideas are: > > - Eliminate the need to do a command-line indexing, let the web app do > this upon command, allowing you to specify where the index lives (there > will be a reasonable default like ~/lucenedemo/index perhaps) and what > directory tree to index (perhaps defaulting to the root directory or > c:\, or where instead?) > > - Spin off a background indexing thread so the web app searching is > immediately useful after kicking off the indexing process, and allow a > status view of the indexing progress. > > - Index text and HTML files. Any others? I don't want to get into > putting too many dependencies in though - let's keep it relatively > simple, although still demonstrative. Allow search filtering by last > modified date range and document type (extension). > > - Perhaps allow you to specify the analyzer to use when indexing. > > - Show the explanation of how scores are computed in the search results > as an option. > > I'm all ears to possibilities of improvements! Send your wishlist. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: slow performance with Date Range Searching
Killeen, Tom wrote: My query would look something like this: LongTitle:killeen AND LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO "2002-04-04"] and it returned in 5.7 seconds Does anyone have any suggestions for searching date ranges. Our ranges will generally be between a 3 - 7 year period. If you use the same date range repeatedly then you can make things fast by replacing it with a filter that you re-use. Try using a QueryFilter with your date range query as the query. Save the QueryFilter object and use it again with future queries. The first query will be slow, but subsequent queries will use the cached results of the first. This is the recommended way to implement a "within the last month", "within the last year", etc. feature. You could even pre-fetch the filter whenever you update the index, by evaluating a query before you put a new IndexReader in production, so that even the first real user query is fast. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wednesday, September 17, 2003, at 09:21 AM, Pitre, Russell wrote: I know this may be far fetched, but how about being able to index .jsp'sI know this is a spindle thing, but It seems a lot of people need this functionality. Like I communicated in a previous thread, indexing JSP's just has a "smell" to it for me. I can't argue with the pragmatic way others have done it by crawling, but I don't think of JSP's as "content" and I'd rather index actual content, that may or may not be later presented within a JSP. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wednesday, September 17, 2003, at 09:31 AM, Bryan LaPlante wrote: I would like to see the taglib for searching the index in the demo. There is an html form page and result page already built for the taglib that allows you to change search params and demonstrates a fair amount of the search capability of Lucene. Bryan, no offense... but I won't be using the taglib in the demo. I just don't feel accessing a Lucene index via a taglib is the right way to do things. Coupling an index to JSP in that manner is too tight for my tastes. What happens if you want to use Velocity for presentation? Or a Swing app? See what I mean? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
Erik Hatcher wrote: On Wednesday, September 17, 2003, at 08:42 AM, Ben Litchfield wrote: What, no PDF files!! Haha! http://www.pdfbox.org And I've used pdfbox before - its cool. And I'm cool with adding PDF and Word indexing to the demo personally, but I didn't want to increase the "weight" of the demo application. If folks feel strongly about it then I'll incorporate it. A word of warning: PDFBox is fantastic, I agree - but some PDFs are not so... In my application I experienced numerous hangs when PDFBox would start parsing some PDFs (I can send the files to Ben if required), and then got stuck in an infinite wait somewhere... So I came up with a workaround: I run the parser in a separate thread, while waiting in the main thread, and then after a certain timeout I kill the processing thread and return. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
I think all the attribute values together with element text values should be indexed in the "content" part. Also a xml map file could be used to pick up the nodes need to be indexed separately so we do not create too many fields by indexing non-critical nodes separately. Simple xpath could be used for the map source, the field name and index type should be the map target. Regards, Hui - Original Message - From: "Robert Koberg" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 10:09 AM Subject: RE: Lucene demo ideas? > Hi, > > Here are a couple of ideas for XML demos: > > 1. simply index the content into one 'content' field. Don't worry about > attributes. > > 2. index a linked Dublin core meta data file: > > And add fields for every element after rdf:Description > > Best, > -Rob > > > > > -Original Message- > > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, September 17, 2003 6:08 AM > > To: Lucene Users List > > > > On Wednesday, September 17, 2003, at 08:43 AM, Killeen, Tom wrote: > > > I would suggest XML as well. > > > > Again, I'd like to hear more about how you'd do this generically. Tell > > me what the field names and values would correspond to when presented > > with an XML file. > > > > Erik > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
> Does anyone have any suggestions on what they'd like to see in the > demo app? Show how lucene can 1) do incremental indexing, 2) isn't restricted to indexing file system resources and 3) can store and query arbitrary fields. These are in my opinion the features where most other search engines fall flat. -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: slow performance with Date Range Searching
> Does anyone have any suggestions for searching date ranges. Our > ranges will generally be between a 3 - 7 year period. Apparently Lucene expands ranges to boolean 'or' queries. So if you have a thousand distinct dates within a range, Lucene will build a query with a thousand terms... One workaround is to first run the query, but without ranges. Store the results into a bit vector with the position corresponding to the document id. Then create a TermEnum that starts with the lower range value. For each term, get the document ids, and set the corresponding values in a second bit vector. Break the loop as soon as the TermEnum has reached or passed the upper limit of the range. Finally, 'and' or 'or' the second bit vector with the first one. It's as simple as that :-) I wonder why Lucene doesn't use this strategy by default. I realize it is less efficient when the range includes few terms, but it seems to scale far better. -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
wildcard search and german umlauts
Hi All, has someone ever written an extension of QueryParser providing the possibility to let wildcard search terms be run through an analyzer ( as suggested by Tatu Saloranta a while ago)? I want to reduce german umlauts to their base letters (eg. 'ä' (ä) to 'a' ) and for non-wildcard queries my UmlautFilter does its job. Or is there any other option? Many thanks, René - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: slow performance with Date Range Searching
I don't know how lucene handles date ranges, but I was having very poor results using booleans between different because of the way lucene handles them. What lucene does is that it evaluates each field in the query separately and retrieves all of the results, then it evaluates the boolean joins between the different fields. So I believe the way lucene is handling the query is: Get all the documents whose LongTitle has killeen in them Get all the documents whose LongTitle has state in them Get all the documents whose StateDistrict has id in them Get all the documents filed between 1997-01-01 and 2002-04-04 (This, incidently, takes up a huge amount of memory) Finally, it evaluates the booleans figures out which documents satisfy all of your criteria and returns that to you. I'm working on a matching engine that takes company information like their name and address and finds our record for that company. I ended up making a separate index for every state and country because it was running too slow and running out of memory when I was using booleans between fields. Maybe you could do something similar with your dates. (i.e. one index per year) -Original Message- From: Killeen, Tom [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 17, 2003 10:01 AM To: 'Lucene Users List' Subject: slow performance with Date Range Searching Hello all, I have recently indexed approx 15.8 million XML documents in which I index the contents certain elements (titles, states, dates to name a few). I have 27 separate indices and use a MultiSearcher to search these indices. When I search on the title and state fields with multiple terms searching is very fast. For example I get a hit count of 227, 000 in .4 seconds. But when I throw a date range in the search, performance suffers significantly. My query would look something like this: LongTitle:killeen AND LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO "2002-04-04"] and it returned in 5.7 seconds Does anyone have any suggestions for searching date ranges. Our ranges will generally be between a 3 - 7 year period. thanks, Tom - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene demo ideas?
Hi, Here are a couple of ideas for XML demos: 1. simply index the content into one 'content' field. Don't worry about attributes. 2. index a linked Dublin core meta data file: And add fields for every element after rdf:Description Best, -Rob > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 17, 2003 6:08 AM > To: Lucene Users List > > On Wednesday, September 17, 2003, at 08:43 AM, Killeen, Tom wrote: > > I would suggest XML as well. > > Again, I'd like to hear more about how you'd do this generically. Tell > me what the field names and values would correspond to when presented > with an XML file. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
slow performance with Date Range Searching
Hello all, I have recently indexed approx 15.8 million XML documents in which I index the contents certain elements (titles, states, dates to name a few). I have 27 separate indices and use a MultiSearcher to search these indices. When I search on the title and state fields with multiple terms searching is very fast. For example I get a hit count of 227, 000 in .4 seconds. But when I throw a date range in the search, performance suffers significantly. My query would look something like this: LongTitle:killeen AND LongTitle:state AND StateDistrict:id AND FiledDate:["1997-01-01" TO "2002-04-04"] and it returned in 5.7 seconds Does anyone have any suggestions for searching date ranges. Our ranges will generally be between a 3 - 7 year period. thanks, Tom - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene demo ideas?
I know this may be far fetched, but how about being able to index .jsp'sI know this is a spindle thing, but It seems a lot of people need this functionality. My suggestion Russ -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 17, 2003 8:01 AM To: [EMAIL PROTECTED] Subject: Lucene demo ideas? I'm about to start some refactorings on the web application demo that ships with Lucene to show off its features and be usable more easily and cleanly out of the box - i.e. just drop into Tomcat's webapps directory and go. Does anyone have any suggestions on what they'd like to see in the demo app? Some of my ideas are: - Eliminate the need to do a command-line indexing, let the web app do this upon command, allowing you to specify where the index lives (there will be a reasonable default like ~/lucenedemo/index perhaps) and what directory tree to index (perhaps defaulting to the root directory or c:\, or where instead?) - Spin off a background indexing thread so the web app searching is immediately useful after kicking off the indexing process, and allow a status view of the indexing progress. - Index text and HTML files. Any others? I don't want to get into putting too many dependencies in though - let's keep it relatively simple, although still demonstrative. Allow search filtering by last modified date range and document type (extension). - Perhaps allow you to specify the analyzer to use when indexing. - Show the explanation of how scores are computed in the search results as an option. I'm all ears to possibilities of improvements! Send your wishlist. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
I would like to see the taglib for searching the index in the demo. There is an html form page and result page already built for the taglib that allows you to change search params and demonstrates a fair amount of the search capability of Lucene. - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 7:00 AM Subject: Lucene demo ideas? > I'm about to start some refactorings on the web application demo that > ships with Lucene to show off its features and be usable more easily > and cleanly out of the box - i.e. just drop into Tomcat's webapps > directory and go. > > Does anyone have any suggestions on what they'd like to see in the demo > app? Some of my ideas are: > > - Eliminate the need to do a command-line indexing, let the web app do > this upon command, allowing you to specify where the index lives (there > will be a reasonable default like ~/lucenedemo/index perhaps) and what > directory tree to index (perhaps defaulting to the root directory or > c:\, or where instead?) > > - Spin off a background indexing thread so the web app searching is > immediately useful after kicking off the indexing process, and allow a > status view of the indexing progress. > > - Index text and HTML files. Any others? I don't want to get into > putting too many dependencies in though - let's keep it relatively > simple, although still demonstrative. Allow search filtering by last > modified date range and document type (extension). > > - Perhaps allow you to specify the analyzer to use when indexing. > > - Show the explanation of how scores are computed in the search results > as an option. > > I'm all ears to possibilities of improvements! Send your wishlist. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
Might want two demos, one for Unix environments and one for Windows. Most users will want a fast start that they can copy and adapt. So quick targets would be: filesystems - html / text / pdf / office documents for windows. xml - fairly simple example maybe against news items. database - again simple maybe a pseudo employee database. website - accessable from the filesystem. website - that requires crawling. Show hit markup. Pete - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 17, 2003 1:00 PM Subject: Lucene demo ideas? > I'm about to start some refactorings on the web application demo that > ships with Lucene to show off its features and be usable more easily > and cleanly out of the box - i.e. just drop into Tomcat's webapps > directory and go. > > Does anyone have any suggestions on what they'd like to see in the demo > app? Some of my ideas are: > > - Eliminate the need to do a command-line indexing, let the web app do > this upon command, allowing you to specify where the index lives (there > will be a reasonable default like ~/lucenedemo/index perhaps) and what > directory tree to index (perhaps defaulting to the root directory or > c:\, or where instead?) > > - Spin off a background indexing thread so the web app searching is > immediately useful after kicking off the indexing process, and allow a > status view of the indexing progress. > > - Index text and HTML files. Any others? I don't want to get into > putting too many dependencies in though - let's keep it relatively > simple, although still demonstrative. Allow search filtering by last > modified date range and document type (extension). > > - Perhaps allow you to specify the analyzer to use when indexing. > > - Show the explanation of how scores are computed in the search results > as an option. > > I'm all ears to possibilities of improvements! Send your wishlist. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wednesday, September 17, 2003, at 08:43 AM, Killeen, Tom wrote: I would suggest XML as well. Again, I'd like to hear more about how you'd do this generically. Tell me what the field names and values would correspond to when presented with an XML file. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wednesday, September 17, 2003, at 08:42 AM, Ben Litchfield wrote: What, no PDF files!! Haha! http://www.pdfbox.org And I've used pdfbox before - its cool. And I'm cool with adding PDF and Word indexing to the demo personally, but I didn't want to increase the "weight" of the demo application. If folks feel strongly about it then I'll incorporate it. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
please keep the discussions on the lucene-user e-mail list. of course the source code will be available... what is there is already in lucene's CVS and i will just revamp what is there and commit it. and when we make lucene releases it will be bundled and made available as a single download too. as for indexing XML files that is a possibility, but that is a broad request. how would they be indexed? every element made a field? every attribute too? what are the field names? is this really appropriate for a "demo"? On Wednesday, September 17, 2003, at 08:42 AM, Senthil Kumar K wrote: hi erik, Is it possible to send a source code for the lucene demo u proposed and i want to index xml files in my application. All i have to do from browser. I have to avoid the command line indexing. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene demo ideas?
I would suggest XML as well. Tom -Original Message- From: Ben Litchfield [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 17, 2003 7:42 AM To: Lucene Users List Subject: Re: Lucene demo ideas? > - Index text and HTML files. Any others? What, no PDF files!! Ben -- http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
> - Index text and HTML files. Any others? What, no PDF files!! Ben -- http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene demo ideas?
I'm about to start some refactorings on the web application demo that ships with Lucene to show off its features and be usable more easily and cleanly out of the box - i.e. just drop into Tomcat's webapps directory and go. Does anyone have any suggestions on what they'd like to see in the demo app? Some of my ideas are: - Eliminate the need to do a command-line indexing, let the web app do this upon command, allowing you to specify where the index lives (there will be a reasonable default like ~/lucenedemo/index perhaps) and what directory tree to index (perhaps defaulting to the root directory or c:\, or where instead?) - Spin off a background indexing thread so the web app searching is immediately useful after kicking off the indexing process, and allow a status view of the indexing progress. - Index text and HTML files. Any others? I don't want to get into putting too many dependencies in though - let's keep it relatively simple, although still demonstrative. Allow search filtering by last modified date range and document type (extension). - Perhaps allow you to specify the analyzer to use when indexing. - Show the explanation of how scores are computed in the search results as an option. I'm all ears to possibilities of improvements! Send your wishlist. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
That would be nice. Contributions are always welcome. Otis --- Chris Sibert <[EMAIL PROTECTED]> wrote: > Thanks for all the replies. I feel reassured with using Lucene. If I > end up > doing anything with the application that I'm writing, I would like to > look > at contributing some documentation of Lucene's features, and what it > has to > offer. > > - Original Message - > From: "Leo Galambos" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Thursday, September 11, 2003 4:57 PM > Subject: Re: Lucene features > > > > Doug Cutting wrote: > > > > > > > > I have some extensions to Lucene that I've not yet commited which > make > > > it possible to easily define synthetic IndexReaders (not > currently > > > supported). So you could do things that way, once I check these > in. > > > But is this really better than just ANDing the clauses together? > It > > > would take some big experiments to know, but my guess is that it > > > doesn't make much difference to compute a "local" IDF for such > things. > > > > > > In this case, I think that the operator would be evaluated as "an > > implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)). > > Obviously, you have to use an filter to filter out false hits (in > case > > of q1->q2, the formula is true when q1 is false, so it is not what > you > > really need), but it is not an issue with the auxiliary index. On > the > > other hand, it is a feeling and it needs a test, you are right. > > > > Leo > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]