Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread Jason Polites
Hello all, I am looking for a strategy to exclude duplicate entries when searching multiple indexes which may contain the same document. I have an email system which archives and indexes emails on a per-recipient basis. So, each email recipient has their own index. In the case where the same

Re: keep indexes as files or save them in database

2005-01-24 Thread Miles Barr
On Sun, 2005-01-23 at 22:09 -0800, Otis Gospodnetic wrote: A number of people have tried putting Lucene indices in RDBMS. As far as I know, all were slower than FSDirectory. Do you know if the Berkeley DB back end also has a performance hit? -- Miles Barr [EMAIL PROTECTED] Runtime

Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread PA
On Jan 24, 2005, at 09:14, Jason Polites wrote: I am aware of the Filter object however the unique identifier of my document is a field within the lucene document itself (messageid); and I am reluctant to access this field using the public API for every Hit as I fear it will have drastic

RE: Stemming

2005-01-24 Thread Kevin L. Cobb
Do stemming algorithms take into consideration abbreviations too? Some examples: mg = milligrams US = United States CD = compact disc vcr = video casette recorder And, the next logical question, if stemming does not take care of abbreviations, are there any solutions that include abbreviations

Re: Stemming

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote: Do stemming algorithms take into consideration abbreviations too? No, they don't. Adding abbreviations, aliases, synonyms, etc is not stemming. And, the next logical question, if stemming does not take care of abbreviations, are there any

RE: Filtering w/ Multiple Terms

2005-01-24 Thread Jerry Jalenak
I spent some time reading the Lucene in Action book this weekend (great job, btw), and came across the section on using custom filters. Since the data that I need to use to filter my hit set with comes from a database, I thought it would be worth my effort this morning to write a custom filter

Re: keep indexes as files or save them in database

2005-01-24 Thread Andi Vajda
On Sun, 2005-01-23 at 22:09 -0800, Otis Gospodnetic wrote: A number of people have tried putting Lucene indices in RDBMS. As far as I know, all were slower than FSDirectory. Do you know if the Berkeley DB back end also has a performance hit? Try it, it all depends on how you configure it. And

Re: Filtering w/ Multiple Terms

2005-01-24 Thread Paul Elschot
Jerry, On Monday 24 January 2005 18:26, Jerry Jalenak wrote: I spent some time reading the Lucene in Action book this weekend (great job, btw), and came across the section on using custom filters. Since the data that I need to use to filter my hit set with comes from a database, I thought it

Re: Filtering w/ Multiple Terms

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote: I spent some time reading the Lucene in Action book this weekend (great job, btw) Thanks! public class AccountFilter extends Filter I see where the AccountFilter is setting the cooresponding 'bits', but I end up without any 'hits': Entering

RE: Filtering w/ Multiple Terms

2005-01-24 Thread Jerry Jalenak
Paul / Erik - I'm use the ParallelMultiSearcher to search three indexes concurrently - hence the three entries into AccountFilter. If I remove the filter from my query, and simply enter the query on the command line, I get two hits back. In other words, I can enter this: smith AND

Re: Filtering w/ Multiple Terms

2005-01-24 Thread Erik Hatcher
As Paul suggested, output the Lucene document numbers from your Hits, and also output which bit you're setting in your filter. Do those sets overlap? Erik On Jan 24, 2005, at 2:13 PM, Jerry Jalenak wrote: Paul / Erik - I'm use the ParallelMultiSearcher to search three indexes

Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-24 Thread David Spencer
Pierrick Brihaye wrote: Hi, David Spencer a écrit : One example of expansion with the synonym boost set to 0.9 is the query big dog expands to: Interesting. Do you plan to add expansion on other Wordnet relationships ? Hypernyms and hyponyms would be a good start point for thesaurus-like

RE: Filtering w/ Multiple Terms

2005-01-24 Thread Jerry Jalenak
sheepish-look-on-face/ After re-reading the book (again), and the javadocs (again), it dawned on my little brain that I needed to have a doc and freq array *the size of maxDocs* for the index reader. I also needed to iterate through the docs array and call bitSet.set for each entry in docs (that

Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread Jason Polites
Agreed on the set of unique messages, however the problem I have is with the count of the Hits. The Hits object may contain 100 results (for example), of which only 90 are unique. Because I am paging through results 10 at a time, I need to know the total count without loading each document.

Sort Performance Problems across large dataset

2005-01-24 Thread Peter Hollas
I am working on a public accessible Struts based species database project where the number of species names is currently at 2.3 million, and in the near future will be somewhere nearer 4 million (probably the largest there is). The species names are typically 1 to 7 words in length, and the

Re: Sort Performance Problems across large dataset

2005-01-24 Thread Stefan Groschupf
Hi, do you optimize the index? Do you tried to implement a own hit collector? Stefan Am 25.01.2005 um 01:01 schrieb Peter Hollas: I am working on a public accessible Struts based species database project where the number of species names is currently at 2.3 million, and in the near future will

Re: Sort Performance Problems across large dataset

2005-01-24 Thread Xiaohong Yang \(Sharon\)
Hi Peter, I just got on the list a few hours ago. I am still reading the source code. I am not going to send this to the list. I would like to know the .2 sec query time for 2 million fields, should it display only the first page (100 or so), not the whole 3000 found? It is very fast I

Re: Sort Performance Problems across large dataset

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:01 PM, Peter Hollas wrote: I am working on a public accessible Struts based Well there's the problem right there :)) (just kidding) To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying.

Re: Sort Performance Problems across large dataset

2005-01-24 Thread Matt Quail
Peter, Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results) You may want to try something like the following (I do this in FishEye, seems to be performant for moderately large field-spaces). Use a custom HitCollector, and store all the

LUCENE + EXCEPTION

2005-01-24 Thread Karthik N S
Hi Guys Apologies.. On STANDALONE Usge of UPDATION/DELETION/ADDITION of Documents into MergerIndex, the Code of mine runs PERFECTLY with out any Problems. But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet Running in SINGLE THREAD MODE,Some times

Re: LUCENE + EXCEPTION

2005-01-24 Thread Chris Lamprecht
Hi Karthik, If you are talking about SingleThreadModel (i.e. your servlet implements javax.servlet.SingleThreadModel), this does not guarantee that two different instances of your servlet won't be run at the same time. It only guarantees that each instance of your servlet will only be run by one

RE: LUCENE + EXCEPTION

2005-01-24 Thread Karthik N S
Hi Ok Still I have the Exeption in process ,If even I try to have a Servlet Single Instance [may be by Authentication processs] , but I made shure that Lucene's MergerIndexing is controlled by single Initiation... But With out any Shared Resource's the Exception is popping on Frequently,