Re: Lucene features

2003-09-11 Thread Doug Cutting
Erik Hatcher wrote:
Yes, you're right.  Getting the scores of a second query based on the 
scores of the first query is probably not trivial, but probably possible 
with Lucene.  And that combined with a QueryFilter would do the trick I 
suspect.  Somehow the scores of the first query could be remembered and 
used as a boost (or other type of factor) the scores of the second query.
Why not just AND together the first and second query?  That way they're 
both incorporated in the ranking.  Filters are good when you don't want 
it to affect the ranking, and also when the first query is a criterion 
that you'll reuse for many queries (e.g., language=french), since the 
bit vectors can be cached (as by QueryFilter).

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-11 Thread Leo Galambos
Doug Cutting wrote:

Erik Hatcher wrote:

Yes, you're right.  Getting the scores of a second query based on the 
scores of the first query is probably not trivial, but probably 
possible with Lucene.  And that combined with a QueryFilter would do 
the trick I suspect.  Somehow the scores of the first query could be 
remembered and used as a boost (or other type of factor) the scores 
of the second query.


Why not just AND together the first and second query?  That way 
they're both incorporated in the ranking.  Filters are good when you 
don't want it to affect the ranking, and also when the first query is 
a criterion that you'll reuse for many queries (e.g., 
language=french), since the bit vectors can be cached (as by 
QueryFilter).


You probably missed the start of our discussion - we are talking about 
this: q1 - q2 which means NOT q1 OR q2, versus q2 - q1 which 
means q1 OR NOT q2. It causes the issue, and it also shows why you 
cannot use the simple AND, because q1 AND q2 != NOT q1 OR q2 != 
q1 OR NOT q2.

Leo

BTW: I didn't see the logic formulas for many years, so it is without 
any guarantee ;-)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-11 Thread Leo Galambos
Doug Cutting wrote:

I have some extensions to Lucene that I've not yet commited which make 
it possible to easily define synthetic IndexReaders (not currently 
supported).  So you could do things that way, once I check these in. 
But is this really better than just ANDing the clauses together?  It 
would take some big experiments to know, but my guess is that it 
doesn't make much difference to compute a local IDF for such things.


In this case, I think that the operator would be evaluated as an 
implication and not AND (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)). 
Obviously, you have to use an filter to filter out false hits (in case 
of q1-q2, the formula is true when q1 is false, so it is not what you 
really need), but it is not an issue with the auxiliary index. On the 
other hand, it is a feeling and it needs a test, you are right.

Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-07 Thread Leo Galambos
Erik Hatcher wrote:

On Friday, September 5, 2003, at 07:45  PM, Leo Galambos wrote:

And for the second time today QueryFilter.  It allows narrowing 
the documents queried to only the documents from a previous Query.


I guess, it would not be an ideal solution - the first query does two 
things a) it selects a subset from the corpus; b) it assigns a 
relevance to each document of this subset. Your solution omits the 
second point. It implies, the solution will not return good hit 
lists, because you will not consider the information value of the 
first query which was given to you by a user.


Yes, you're right.  Getting the scores of a second query based on the 
scores of the first query is probably not trivial, but probably 
possible with Lucene.  And that combined with a QueryFilter would do 
the trick I suspect.  Somehow the scores of the first query could be 
remembered and used as a boost (or other type of factor) the scores of 
the second query.


Well, I do not want to be a pessimist, but the boost vector is not a 
good solution due to CWI statistics. On the other hand, it is much 
better than the simple QueryFilter which, in fact, works as 0/1 boost.

Example: I use this notation: inverted_list_term:{list of W values, - 
denotes W=0, for 12 documents in a collection}
A:{23[16]--27}
B:{--[38]}
C:{18[2-]45239812}
If your first query is B, the subset of documents (denoted by brackets - 
namely, the 3rd and 4th doc) is selected, and if your second query is A 
C, then you cannot use global IDFs, because in the subset, the IDF 
factors are different. Globally, A is better distriminator, but in the 
subset, C is better. This fact is then reflected by the hit list you 
generate, and I guess, the quality will be also affected by this.

The example shows, that you would rather export the subset to an 
auxiliary index (RAMDirectory?) and then use this structure instead of 
the original index. Obviously, it will solve the issue of speed you 
mentioned.

Unfortunately, I am not sure, if you can export the inverted lists when 
you read them. In egothor, I would use a listener in Rider class, in 
Lucene, I would have to rewrite some classes and it could be a real 
problem. Maybe, there is a solution I do not see...

Your turn ;-)
Cheers,
Leo
Am I off base here?

Thus I think, Chris would implement something more complex than 
QueryFilter. If not, the results will be poorer than with the 
commercial packages he may get. He could use a different model where 
AND is not an associative operator (i.e. some modification of the 
extended Boolean model). It implies, he would implement it in 
Similarity.java (if I remember that class name correctly).


Right... but you'd still need the filtering capability as well, I 
would think - at least for performance reasons.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-06 Thread Erik Hatcher
On Friday, September 5, 2003, at 07:45  PM, Leo Galambos wrote:
And for the second time today QueryFilter.  It allows narrowing 
the documents queried to only the documents from a previous Query.


I guess, it would not be an ideal solution - the first query does two 
things a) it selects a subset from the corpus; b) it assigns a 
relevance to each document of this subset. Your solution omits the 
second point. It implies, the solution will not return good hit 
lists, because you will not consider the information value of the 
first query which was given to you by a user.
Yes, you're right.  Getting the scores of a second query based on the 
scores of the first query is probably not trivial, but probably 
possible with Lucene.  And that combined with a QueryFilter would do 
the trick I suspect.  Somehow the scores of the first query could be 
remembered and used as a boost (or other type of factor) the scores of 
the second query.

Am I off base here?

Thus I think, Chris would implement something more complex than 
QueryFilter. If not, the results will be poorer than with the 
commercial packages he may get. He could use a different model where 
AND is not an associative operator (i.e. some modification of the 
extended Boolean model). It implies, he would implement it in 
Similarity.java (if I remember that class name correctly).
Right... but you'd still need the filtering capability as well, I would 
think - at least for performance reasons.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-05 Thread Chris Sibert
I'm not sure what all of the 'advanced features' were also.

Phonetic Searching - probably not important to this application.

Synonym searching might be desirable, but now that I'm thinking about it,
also likely not important.

Associated Words - sounds very interesting, like 'gold' might return 'metal'
also, etc.

But Drill Down searching is very desirable. It's where you're able to search
within the results of a previous search. I'm assuming that I'll have to
implement that myself, by keeping a copy of the previous Hits list, and only
returning results that are in both lists.

Thanks very much for your reply.

- Original Message - 
From: Steven J. Owens [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, September 04, 2003 3:02 AM
Subject: Re: Lucene features


 On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
  Lucene Users List [EMAIL PROTECTED]
I am wondering if Lucene is the way to go for my project.
Probably.  Tell us a little about your project.
 
  It's pretty basic. I'm just indexing 4 large text files, ranging up to
100MB
  in size. They don't ever change, and are on a CD-ROM. Each file contains
a
  bunch of small documents. I just create one index for all 4 of them.
These
  documents are for an association that I belong to - they contain a
history
  of the association's documents - and my application allows you to search
  them.

  Well, aside from your concerns about the second list, Lucene
 seems perfect for your needs.  You'd parse apart the four big files
 into a bunch of small documents, the parse those small documents and
 create lucene Documents, containing Fields, and add them to the index.

  They are actually currently indexed by an application called
  'Sonar', by Virginia Systems. But I REALLY didn't like using their
  user interface - blech - so I decided to write a new interface for
  my own use. But Sonar costs some real bucks to be able to develop
  against their search API, so I found Lucene, and decided to go with
  it.
 
  Here are the search features that 'Sonar' has :
Boolean Searching
Proximity Searching
Wild Card Searching
Field/Block Searching

  I'm not sure what Field/Block means.  Boolean, Proximity and
 WildCard, are pretty typical in Lucene searches.  You should probably
 take a look at the Query Parser syntax docs:

  http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


Relevancy Ranking / Date Ranking

  Lucene search results are typically ranked by relevance, and you
 can tweak the search to adjust this (there's a fair bit of discussion
 of this in the lucene-user archives, a good keyword to look for is
 slop and boost).

  Sorting output by date might take some finesse.  I haven't played
 with sorting by date, but I'd expect to handle that by directly
 instantiating a QueryTerm to indicate the date issues.

List of Occurrences in Context

  I assume here that you mean displaying the results with a little
 snapshot of the text around it.  There have been discussions about how
 best to do this (often focused around highlighting the search terms in
 the displayed text) on the lucene-users list.  Check the list archive.

Phonetic Searching

  I'd guess you need to build this one yourself, perhaps by using a
 soundex algorithm when indexing the original data files.

Synonyms/Concepts

  Likewise... you'd need to come up with some sort of ontology of
 synonyms and concepts, then parse the fields you're indexing and
 generate a synonym/concept field that you'd add to the lucene
 Document.

Relational Searching
Associated Words
Drill Down Search Narrowing

  I'm not sure what these three mean.

  I think that Lucene has all the features in the first group. How does it
  stack up against the second group ?

  I'm afraid I haven't been too helpful here.  Perhaps if you
 clarify what the above mean, folks can post about how to implement it
 in Lucene.

  I'm writing the whole thing in Swing, which has been time consuming,
  and so have invested quite a bit of time into this project. But I'm
  seeing the end of the tunnel, and want to make sure that I'm going
  down the right path before I spend too much more time on it.

  It sounds like you ought to at least seriously consider using
 Lucene, if you can find or implement equivalent features, or decide
 you can live without them.

 -- 
 Steven J. Owens
 [EMAIL PROTECTED]

 I'm going to make broad, sweeping generalizations and strong,
  declarative statements, because otherwise I'll be here all night and
  this document will be four times longer and much less fun to read.
  Take it all with a grain of salt. - Me at http://darksleep.com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED

Re: Lucene features

2003-09-05 Thread Erik Hatcher
On Friday, September 5, 2003, at 02:36  PM, Chris Sibert wrote:
Synonym searching might be desirable, but now that I'm thinking about 
it,
also likely not important.
This could be done with a custom Analyzer.

Associated Words - sounds very interesting, like 'gold' might return 
'metal'
also, etc.
How is that different from Synonym searching?

But Drill Down searching is very desirable. It's where you're able to 
search
within the results of a previous search. I'm assuming that I'll have to
implement that myself, by keeping a copy of the previous Hits list, 
and only
returning results that are in both lists.
And for the second time today QueryFilter.  It allows narrowing the 
documents queried to only the documents from a previous Query.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-05 Thread Andrzej Bialecki
Chris Sibert wrote:
I'm not sure what all of the 'advanced features' were also.

Phonetic Searching - probably not important to this application.

Phonetic searching may be achieved by writing your own Analyzer, which 
instead (or more probably, along with) the plain tokens provides their 
phonetic codes, e.g. Double Metaphone for English, or the less useful 
but more familiar Soundex. Phonetic searching increases recall but 
lowers precision, especially if you use stemmer before phonetic encoding...

One trick to consider if using phonetic encoding is to keep around the 
histogram of the original words that have been mapped to corresponding 
phonetic codes. Then, if a query fails to provide satisfactory results, 
you can provide a useful suggestion based on the most frequent term 
found in the histogram, with equal phonetic code to the term in the query.

Synonym searching might be desirable, but now that I'm thinking about it,
also likely not important.
Associated Words - sounds very interesting, like 'gold' might return 'metal'
also, etc.
If you plan on using just English text, you may want to check the 
excellent (and free) WordNet database, which offers also API - both for 
query expansion and for finding associated words (synsets?), or 
hypernyms like in your example.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-05 Thread Leo Galambos

But Drill Down searching is very desirable. It's where you're able to 
search
within the results of a previous search. I'm assuming that I'll have to
implement that myself, by keeping a copy of the previous Hits list, 
and only
returning results that are in both lists.


And for the second time today QueryFilter.  It allows narrowing 
the documents queried to only the documents from a previous Query.


I guess, it would not be an ideal solution - the first query does two 
things a) it selects a subset from the corpus; b) it assigns a relevance 
to each document of this subset. Your solution omits the second point. 
It implies, the solution will not return good hit lists, because you 
will not consider the information value of the first query which was 
given to you by a user.

For instance, neologism  George Bush (1st2nd query) would return 
different order of hits than George Bush  neologism. Other 
examples, Prague Berlin  flight (I must go there, and I prefer an 
airplane) versus flight  Prague Berlin (I must fly, and I prefer 
Berlin).

Thus I think, Chris would implement something more complex than 
QueryFilter. If not, the results will be poorer than with the commercial 
packages he may get. He could use a different model where AND is not 
an associative operator (i.e. some modification of the extended Boolean 
model). It implies, he would implement it in Similarity.java (if I 
remember that class name correctly).

Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-04 Thread Steven J. Owens
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
 Lucene Users List [EMAIL PROTECTED]
   I am wondering if Lucene is the way to go for my project.
   Probably.  Tell us a little about your project.
 
 It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
 in size. They don't ever change, and are on a CD-ROM. Each file contains a
 bunch of small documents. I just create one index for all 4 of them. These
 documents are for an association that I belong to - they contain a history
 of the association's documents - and my application allows you to search
 them.

 Well, aside from your concerns about the second list, Lucene
seems perfect for your needs.  You'd parse apart the four big files
into a bunch of small documents, the parse those small documents and
create lucene Documents, containing Fields, and add them to the index.
 
 They are actually currently indexed by an application called
 'Sonar', by Virginia Systems. But I REALLY didn't like using their
 user interface - blech - so I decided to write a new interface for
 my own use. But Sonar costs some real bucks to be able to develop
 against their search API, so I found Lucene, and decided to go with
 it.
 
 Here are the search features that 'Sonar' has :
   Boolean Searching
   Proximity Searching
   Wild Card Searching
   Field/Block Searching

 I'm not sure what Field/Block means.  Boolean, Proximity and
WildCard, are pretty typical in Lucene searches.  You should probably
take a look at the Query Parser syntax docs:

 http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


   Relevancy Ranking / Date Ranking

 Lucene search results are typically ranked by relevance, and you
can tweak the search to adjust this (there's a fair bit of discussion
of this in the lucene-user archives, a good keyword to look for is
slop and boost).

 Sorting output by date might take some finesse.  I haven't played
with sorting by date, but I'd expect to handle that by directly
instantiating a QueryTerm to indicate the date issues.

   List of Occurrences in Context

 I assume here that you mean displaying the results with a little
snapshot of the text around it.  There have been discussions about how
best to do this (often focused around highlighting the search terms in
the displayed text) on the lucene-users list.  Check the list archive.
 
   Phonetic Searching

 I'd guess you need to build this one yourself, perhaps by using a
soundex algorithm when indexing the original data files.

   Synonyms/Concepts

 Likewise... you'd need to come up with some sort of ontology of
synonyms and concepts, then parse the fields you're indexing and
generate a synonym/concept field that you'd add to the lucene
Document.

   Relational Searching
   Associated Words
   Drill Down Search Narrowing

 I'm not sure what these three mean.

 I think that Lucene has all the features in the first group. How does it
 stack up against the second group ?

 I'm afraid I haven't been too helpful here.  Perhaps if you
clarify what the above mean, folks can post about how to implement it
in Lucene.

 I'm writing the whole thing in Swing, which has been time consuming,
 and so have invested quite a bit of time into this project. But I'm
 seeing the end of the tunnel, and want to make sure that I'm going
 down the right path before I spend too much more time on it.

 It sounds like you ought to at least seriously consider using
Lucene, if you can find or implement equivalent features, or decide
you can live without them.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene features

2003-09-03 Thread Chris Sibert
Lucene Users List [EMAIL PROTECTED]

  I am wondering if Lucene is the way to go for my project.

  Probably.  Tell us a little about your project.

It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
in size. They don't ever change, and are on a CD-ROM. Each file contains a
bunch of small documents. I just create one index for all 4 of them. These
documents are for an association that I belong to - they contain a history
of the association's documents - and my application allows you to search
them.

They are actually currently indexed by an application called 'Sonar', by
Virginia Systems. But I REALLY didn't like using their user interface -
blech - so I decided to
write a new interface for my own use. But Sonar costs some real bucks to be
able to develop against their search API, so I found Lucene, and decided to
go with it.

Here are the search features that 'Sonar' has :
  Boolean Searching
  Proximity Searching
  Wild Card Searching
  Field/Block Searching
  Relevancy Ranking / Date Ranking
  List of Occurrences in Context

  Phonetic Searching
  Synonyms/Concepts
  Relational Searching
  Associated Words
  Drill Down Search Narrowing

I think that Lucene has all the features in the first group. How does it
stack up against the second group ?

I'm writing the whole thing in Swing, which has been time consuming, and so
have invested quite a bit of time into this project. But I'm seeing the end
of the tunnel, and want to make sure that I'm going down the right path
before I spend too much more time on it.


- Original Message - 
From: Steven J. Owens [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 03, 2003 1:34 AM
Subject: Re: Lucene features


 On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote:
  I am wondering if Lucene is the way to go for my project.

  Probably.  Tell us a little about your project.

  I don't know what other search engines are available out there,

  Lucene isn't a search engine _application_, it's a search engine
 _API_.  Lucene gives you what you need in order to build the search
 engine you want, instead of spending gobs of time trying to figure out
 the 10,000 options available for a search engine application, or
 trying to warp somebody else's ideas of what you need to meet what you
 really need.

  and how Lucene stacks up against them.

  Pretty well, if you're willing to put a (very) little time and
 energy into to building the application you need.  I know.  I've done
 it.

  I am wondering if Lucene has a full set of searching features,
  comparable to what I might find in a reasonably priced commercial
  package.

  There is no comparison :-).  Lucene is a fundamentally decent
 piece of technology.  This puts it head and shoulders above most
 commercial packages.

  Specifically, the Lucene search engine API is blindingly fast at
 searching and at indexing, and comes with several built-in packages to
 provide several of the commonly needed functions (like a web search
 engine style query language parser).

  Additionally, a wide variety of people have been down this road
 and done a wide variety of things with Lucene, so you're likely to be
 able to find examples, in the Lucene sandbox or in the lucene-user
 archives, of how to do whatever it is you want to do.

  Anyone with a solid knowledge of Lucene care to make me feel warm
  and fuzzy about my decision so far to use Lucene ?

  Tell us a little more about your project requirements and I'll
 tell you enough specifics to give you a warm and fuzzy feeling.
 Lucene isn't perfect for _everything_ (and anybody who claims that a
 given technology *is* perfect for _everything_ is lying).  But it's
 quite good for a number of things.

 -- 
 Steven J. Owens
 [EMAIL PROTECTED]

 I'm going to make broad, sweeping generalizations and strong,
  declarative statements, because otherwise I'll be here all night and
  this document will be four times longer and much less fun to read.
  Take it all with a grain of salt. - Me at http://darksleep.com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene features

2003-09-02 Thread Chris Sibert
I am wondering if Lucene is the way to go for my project. I don't know what
other search engines are available out there, and how Lucene stacks up
against them. I am wondering if Lucene has a full set of searching features,
comparable to what I might find in a reasonably priced commercial package.
Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy
about my decision so far to use Lucene ?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene features

2003-09-02 Thread Steven J. Owens
On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote:
 I am wondering if Lucene is the way to go for my project.

 Probably.  Tell us a little about your project.

 I don't know what other search engines are available out there,

 Lucene isn't a search engine _application_, it's a search engine
_API_.  Lucene gives you what you need in order to build the search
engine you want, instead of spending gobs of time trying to figure out
the 10,000 options available for a search engine application, or
trying to warp somebody else's ideas of what you need to meet what you
really need.

 and how Lucene stacks up against them.

 Pretty well, if you're willing to put a (very) little time and
energy into to building the application you need.  I know.  I've done
it.

 I am wondering if Lucene has a full set of searching features,
 comparable to what I might find in a reasonably priced commercial
 package.

 There is no comparison :-).  Lucene is a fundamentally decent
piece of technology.  This puts it head and shoulders above most
commercial packages.

 Specifically, the Lucene search engine API is blindingly fast at
searching and at indexing, and comes with several built-in packages to
provide several of the commonly needed functions (like a web search
engine style query language parser).  

 Additionally, a wide variety of people have been down this road
and done a wide variety of things with Lucene, so you're likely to be
able to find examples, in the Lucene sandbox or in the lucene-user
archives, of how to do whatever it is you want to do.

 Anyone with a solid knowledge of Lucene care to make me feel warm
 and fuzzy about my decision so far to use Lucene ?

 Tell us a little more about your project requirements and I'll
tell you enough specifics to give you a warm and fuzzy feeling.
Lucene isn't perfect for _everything_ (and anybody who claims that a
given technology *is* perfect for _everything_ is lying).  But it's
quite good for a number of things.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]