Re: Lucene features
Erik Hatcher wrote: Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be remembered and used as a boost (or other type of factor) the scores of the second query. Why not just AND together the first and second query? That way they're both incorporated in the ranking. Filters are good when you don't want it to affect the ranking, and also when the first query is a criterion that you'll reuse for many queries (e.g., language=french), since the bit vectors can be cached (as by QueryFilter). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Doug Cutting wrote: Erik Hatcher wrote: Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be remembered and used as a boost (or other type of factor) the scores of the second query. Why not just AND together the first and second query? That way they're both incorporated in the ranking. Filters are good when you don't want it to affect the ranking, and also when the first query is a criterion that you'll reuse for many queries (e.g., language=french), since the bit vectors can be cached (as by QueryFilter). You probably missed the start of our discussion - we are talking about this: q1 - q2 which means NOT q1 OR q2, versus q2 - q1 which means q1 OR NOT q2. It causes the issue, and it also shows why you cannot use the simple AND, because q1 AND q2 != NOT q1 OR q2 != q1 OR NOT q2. Leo BTW: I didn't see the logic formulas for many years, so it is without any guarantee ;-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Doug Cutting wrote: I have some extensions to Lucene that I've not yet commited which make it possible to easily define synthetic IndexReaders (not currently supported). So you could do things that way, once I check these in. But is this really better than just ANDing the clauses together? It would take some big experiments to know, but my guess is that it doesn't make much difference to compute a local IDF for such things. In this case, I think that the operator would be evaluated as an implication and not AND (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)). Obviously, you have to use an filter to filter out false hits (in case of q1-q2, the formula is true when q1 is false, so it is not what you really need), but it is not an issue with the auxiliary index. On the other hand, it is a feeling and it needs a test, you are right. Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Erik Hatcher wrote: On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote: And for the second time today QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query. I guess, it would not be an ideal solution - the first query does two things a) it selects a subset from the corpus; b) it assigns a relevance to each document of this subset. Your solution omits the second point. It implies, the solution will not return good hit lists, because you will not consider the information value of the first query which was given to you by a user. Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be remembered and used as a boost (or other type of factor) the scores of the second query. Well, I do not want to be a pessimist, but the boost vector is not a good solution due to CWI statistics. On the other hand, it is much better than the simple QueryFilter which, in fact, works as 0/1 boost. Example: I use this notation: inverted_list_term:{list of W values, - denotes W=0, for 12 documents in a collection} A:{23[16]--27} B:{--[38]} C:{18[2-]45239812} If your first query is B, the subset of documents (denoted by brackets - namely, the 3rd and 4th doc) is selected, and if your second query is A C, then you cannot use global IDFs, because in the subset, the IDF factors are different. Globally, A is better distriminator, but in the subset, C is better. This fact is then reflected by the hit list you generate, and I guess, the quality will be also affected by this. The example shows, that you would rather export the subset to an auxiliary index (RAMDirectory?) and then use this structure instead of the original index. Obviously, it will solve the issue of speed you mentioned. Unfortunately, I am not sure, if you can export the inverted lists when you read them. In egothor, I would use a listener in Rider class, in Lucene, I would have to rewrite some classes and it could be a real problem. Maybe, there is a solution I do not see... Your turn ;-) Cheers, Leo Am I off base here? Thus I think, Chris would implement something more complex than QueryFilter. If not, the results will be poorer than with the commercial packages he may get. He could use a different model where AND is not an associative operator (i.e. some modification of the extended Boolean model). It implies, he would implement it in Similarity.java (if I remember that class name correctly). Right... but you'd still need the filtering capability as well, I would think - at least for performance reasons. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote: And for the second time today QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query. I guess, it would not be an ideal solution - the first query does two things a) it selects a subset from the corpus; b) it assigns a relevance to each document of this subset. Your solution omits the second point. It implies, the solution will not return good hit lists, because you will not consider the information value of the first query which was given to you by a user. Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be remembered and used as a boost (or other type of factor) the scores of the second query. Am I off base here? Thus I think, Chris would implement something more complex than QueryFilter. If not, the results will be poorer than with the commercial packages he may get. He could use a different model where AND is not an associative operator (i.e. some modification of the extended Boolean model). It implies, he would implement it in Similarity.java (if I remember that class name correctly). Right... but you'd still need the filtering capability as well, I would think - at least for performance reasons. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
I'm not sure what all of the 'advanced features' were also. Phonetic Searching - probably not important to this application. Synonym searching might be desirable, but now that I'm thinking about it, also likely not important. Associated Words - sounds very interesting, like 'gold' might return 'metal' also, etc. But Drill Down searching is very desirable. It's where you're able to search within the results of a previous search. I'm assuming that I'll have to implement that myself, by keeping a copy of the previous Hits list, and only returning results that are in both lists. Thanks very much for your reply. - Original Message - From: Steven J. Owens [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, September 04, 2003 3:02 AM Subject: Re: Lucene features On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote: Lucene Users List [EMAIL PROTECTED] I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB in size. They don't ever change, and are on a CD-ROM. Each file contains a bunch of small documents. I just create one index for all 4 of them. These documents are for an association that I belong to - they contain a history of the association's documents - and my application allows you to search them. Well, aside from your concerns about the second list, Lucene seems perfect for your needs. You'd parse apart the four big files into a bunch of small documents, the parse those small documents and create lucene Documents, containing Fields, and add them to the index. They are actually currently indexed by an application called 'Sonar', by Virginia Systems. But I REALLY didn't like using their user interface - blech - so I decided to write a new interface for my own use. But Sonar costs some real bucks to be able to develop against their search API, so I found Lucene, and decided to go with it. Here are the search features that 'Sonar' has : Boolean Searching Proximity Searching Wild Card Searching Field/Block Searching I'm not sure what Field/Block means. Boolean, Proximity and WildCard, are pretty typical in Lucene searches. You should probably take a look at the Query Parser syntax docs: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Relevancy Ranking / Date Ranking Lucene search results are typically ranked by relevance, and you can tweak the search to adjust this (there's a fair bit of discussion of this in the lucene-user archives, a good keyword to look for is slop and boost). Sorting output by date might take some finesse. I haven't played with sorting by date, but I'd expect to handle that by directly instantiating a QueryTerm to indicate the date issues. List of Occurrences in Context I assume here that you mean displaying the results with a little snapshot of the text around it. There have been discussions about how best to do this (often focused around highlighting the search terms in the displayed text) on the lucene-users list. Check the list archive. Phonetic Searching I'd guess you need to build this one yourself, perhaps by using a soundex algorithm when indexing the original data files. Synonyms/Concepts Likewise... you'd need to come up with some sort of ontology of synonyms and concepts, then parse the fields you're indexing and generate a synonym/concept field that you'd add to the lucene Document. Relational Searching Associated Words Drill Down Search Narrowing I'm not sure what these three mean. I think that Lucene has all the features in the first group. How does it stack up against the second group ? I'm afraid I haven't been too helpful here. Perhaps if you clarify what the above mean, folks can post about how to implement it in Lucene. I'm writing the whole thing in Swing, which has been time consuming, and so have invested quite a bit of time into this project. But I'm seeing the end of the tunnel, and want to make sure that I'm going down the right path before I spend too much more time on it. It sounds like you ought to at least seriously consider using Lucene, if you can find or implement equivalent features, or decide you can live without them. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: Lucene features
On Friday, September 5, 2003, at 02:36 PM, Chris Sibert wrote: Synonym searching might be desirable, but now that I'm thinking about it, also likely not important. This could be done with a custom Analyzer. Associated Words - sounds very interesting, like 'gold' might return 'metal' also, etc. How is that different from Synonym searching? But Drill Down searching is very desirable. It's where you're able to search within the results of a previous search. I'm assuming that I'll have to implement that myself, by keeping a copy of the previous Hits list, and only returning results that are in both lists. And for the second time today QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Chris Sibert wrote: I'm not sure what all of the 'advanced features' were also. Phonetic Searching - probably not important to this application. Phonetic searching may be achieved by writing your own Analyzer, which instead (or more probably, along with) the plain tokens provides their phonetic codes, e.g. Double Metaphone for English, or the less useful but more familiar Soundex. Phonetic searching increases recall but lowers precision, especially if you use stemmer before phonetic encoding... One trick to consider if using phonetic encoding is to keep around the histogram of the original words that have been mapped to corresponding phonetic codes. Then, if a query fails to provide satisfactory results, you can provide a useful suggestion based on the most frequent term found in the histogram, with equal phonetic code to the term in the query. Synonym searching might be desirable, but now that I'm thinking about it, also likely not important. Associated Words - sounds very interesting, like 'gold' might return 'metal' also, etc. If you plan on using just English text, you may want to check the excellent (and free) WordNet database, which offers also API - both for query expansion and for finding associated words (synsets?), or hypernyms like in your example. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
But Drill Down searching is very desirable. It's where you're able to search within the results of a previous search. I'm assuming that I'll have to implement that myself, by keeping a copy of the previous Hits list, and only returning results that are in both lists. And for the second time today QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query. I guess, it would not be an ideal solution - the first query does two things a) it selects a subset from the corpus; b) it assigns a relevance to each document of this subset. Your solution omits the second point. It implies, the solution will not return good hit lists, because you will not consider the information value of the first query which was given to you by a user. For instance, neologism George Bush (1st2nd query) would return different order of hits than George Bush neologism. Other examples, Prague Berlin flight (I must go there, and I prefer an airplane) versus flight Prague Berlin (I must fly, and I prefer Berlin). Thus I think, Chris would implement something more complex than QueryFilter. If not, the results will be poorer than with the commercial packages he may get. He could use a different model where AND is not an associative operator (i.e. some modification of the extended Boolean model). It implies, he would implement it in Similarity.java (if I remember that class name correctly). Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote: Lucene Users List [EMAIL PROTECTED] I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB in size. They don't ever change, and are on a CD-ROM. Each file contains a bunch of small documents. I just create one index for all 4 of them. These documents are for an association that I belong to - they contain a history of the association's documents - and my application allows you to search them. Well, aside from your concerns about the second list, Lucene seems perfect for your needs. You'd parse apart the four big files into a bunch of small documents, the parse those small documents and create lucene Documents, containing Fields, and add them to the index. They are actually currently indexed by an application called 'Sonar', by Virginia Systems. But I REALLY didn't like using their user interface - blech - so I decided to write a new interface for my own use. But Sonar costs some real bucks to be able to develop against their search API, so I found Lucene, and decided to go with it. Here are the search features that 'Sonar' has : Boolean Searching Proximity Searching Wild Card Searching Field/Block Searching I'm not sure what Field/Block means. Boolean, Proximity and WildCard, are pretty typical in Lucene searches. You should probably take a look at the Query Parser syntax docs: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Relevancy Ranking / Date Ranking Lucene search results are typically ranked by relevance, and you can tweak the search to adjust this (there's a fair bit of discussion of this in the lucene-user archives, a good keyword to look for is slop and boost). Sorting output by date might take some finesse. I haven't played with sorting by date, but I'd expect to handle that by directly instantiating a QueryTerm to indicate the date issues. List of Occurrences in Context I assume here that you mean displaying the results with a little snapshot of the text around it. There have been discussions about how best to do this (often focused around highlighting the search terms in the displayed text) on the lucene-users list. Check the list archive. Phonetic Searching I'd guess you need to build this one yourself, perhaps by using a soundex algorithm when indexing the original data files. Synonyms/Concepts Likewise... you'd need to come up with some sort of ontology of synonyms and concepts, then parse the fields you're indexing and generate a synonym/concept field that you'd add to the lucene Document. Relational Searching Associated Words Drill Down Search Narrowing I'm not sure what these three mean. I think that Lucene has all the features in the first group. How does it stack up against the second group ? I'm afraid I haven't been too helpful here. Perhaps if you clarify what the above mean, folks can post about how to implement it in Lucene. I'm writing the whole thing in Swing, which has been time consuming, and so have invested quite a bit of time into this project. But I'm seeing the end of the tunnel, and want to make sure that I'm going down the right path before I spend too much more time on it. It sounds like you ought to at least seriously consider using Lucene, if you can find or implement equivalent features, or decide you can live without them. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Lucene Users List [EMAIL PROTECTED] I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB in size. They don't ever change, and are on a CD-ROM. Each file contains a bunch of small documents. I just create one index for all 4 of them. These documents are for an association that I belong to - they contain a history of the association's documents - and my application allows you to search them. They are actually currently indexed by an application called 'Sonar', by Virginia Systems. But I REALLY didn't like using their user interface - blech - so I decided to write a new interface for my own use. But Sonar costs some real bucks to be able to develop against their search API, so I found Lucene, and decided to go with it. Here are the search features that 'Sonar' has : Boolean Searching Proximity Searching Wild Card Searching Field/Block Searching Relevancy Ranking / Date Ranking List of Occurrences in Context Phonetic Searching Synonyms/Concepts Relational Searching Associated Words Drill Down Search Narrowing I think that Lucene has all the features in the first group. How does it stack up against the second group ? I'm writing the whole thing in Swing, which has been time consuming, and so have invested quite a bit of time into this project. But I'm seeing the end of the tunnel, and want to make sure that I'm going down the right path before I spend too much more time on it. - Original Message - From: Steven J. Owens [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 03, 2003 1:34 AM Subject: Re: Lucene features On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote: I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. I don't know what other search engines are available out there, Lucene isn't a search engine _application_, it's a search engine _API_. Lucene gives you what you need in order to build the search engine you want, instead of spending gobs of time trying to figure out the 10,000 options available for a search engine application, or trying to warp somebody else's ideas of what you need to meet what you really need. and how Lucene stacks up against them. Pretty well, if you're willing to put a (very) little time and energy into to building the application you need. I know. I've done it. I am wondering if Lucene has a full set of searching features, comparable to what I might find in a reasonably priced commercial package. There is no comparison :-). Lucene is a fundamentally decent piece of technology. This puts it head and shoulders above most commercial packages. Specifically, the Lucene search engine API is blindingly fast at searching and at indexing, and comes with several built-in packages to provide several of the commonly needed functions (like a web search engine style query language parser). Additionally, a wide variety of people have been down this road and done a wide variety of things with Lucene, so you're likely to be able to find examples, in the Lucene sandbox or in the lucene-user archives, of how to do whatever it is you want to do. Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy about my decision so far to use Lucene ? Tell us a little more about your project requirements and I'll tell you enough specifics to give you a warm and fuzzy feeling. Lucene isn't perfect for _everything_ (and anybody who claims that a given technology *is* perfect for _everything_ is lying). But it's quite good for a number of things. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene features
I am wondering if Lucene is the way to go for my project. I don't know what other search engines are available out there, and how Lucene stacks up against them. I am wondering if Lucene has a full set of searching features, comparable to what I might find in a reasonably priced commercial package. Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy about my decision so far to use Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote: I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. I don't know what other search engines are available out there, Lucene isn't a search engine _application_, it's a search engine _API_. Lucene gives you what you need in order to build the search engine you want, instead of spending gobs of time trying to figure out the 10,000 options available for a search engine application, or trying to warp somebody else's ideas of what you need to meet what you really need. and how Lucene stacks up against them. Pretty well, if you're willing to put a (very) little time and energy into to building the application you need. I know. I've done it. I am wondering if Lucene has a full set of searching features, comparable to what I might find in a reasonably priced commercial package. There is no comparison :-). Lucene is a fundamentally decent piece of technology. This puts it head and shoulders above most commercial packages. Specifically, the Lucene search engine API is blindingly fast at searching and at indexing, and comes with several built-in packages to provide several of the commonly needed functions (like a web search engine style query language parser). Additionally, a wide variety of people have been down this road and done a wide variety of things with Lucene, so you're likely to be able to find examples, in the Lucene sandbox or in the lucene-user archives, of how to do whatever it is you want to do. Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy about my decision so far to use Lucene ? Tell us a little more about your project requirements and I'll tell you enough specifics to give you a warm and fuzzy feeling. Lucene isn't perfect for _everything_ (and anybody who claims that a given technology *is* perfect for _everything_ is lying). But it's quite good for a number of things. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]