RE: LUCENE + EXCEPTION
Hi Ok Still I have the Exeption in process ,If even I try to have a Servlet Single Instance [may be by Authentication processs] , but I made shure that Lucene's MergerIndexing is controlled by single Initiation... But With out any Shared Resource's the Exception is popping on Frequently, java.io.IOException: read past EOF at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou ndFileReader.java:218) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) Please Help me [ I could not find any solution on Lucene Form for the same,may be I am the only one with the issue] Karthik -Original Message- From: Chris Lamprecht [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 25, 2005 9:48 AM To: Lucene Users List Subject: Re: LUCENE + EXCEPTION Hi Karthik, If you are talking about SingleThreadModel (i.e. your servlet implements javax.servlet.SingleThreadModel), this does not guarantee that two different instances of your servlet won't be run at the same time. It only guarantees that each instance of your servlet will only be run by one thread at a time. See: http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/SingleThreadMode l.html If you are accessing a shared resource (a lucene index), you'll have to prevent concurrent modifications somehow other than SingleThreadModel. I think they've finally deprecated SingleThreadModel in the latest (may be not even out yet) servlet spec. -chris > On STANDALONE Usge of UPDATION/DELETION/ADDITION of Documents into > MergerIndex, the Code of mine > > > runs PERFECTLY with out any Problems. > > > But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet > Running in SINGLE THREAD MODE,Some times > > > Frequently I get the Error as below - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE + EXCEPTION
Hi Karthik, If you are talking about SingleThreadModel (i.e. your servlet implements javax.servlet.SingleThreadModel), this does not guarantee that two different instances of your servlet won't be run at the same time. It only guarantees that each instance of your servlet will only be run by one thread at a time. See: http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/SingleThreadModel.html If you are accessing a shared resource (a lucene index), you'll have to prevent concurrent modifications somehow other than SingleThreadModel. I think they've finally deprecated SingleThreadModel in the latest (may be not even out yet) servlet spec. -chris > On STANDALONE Usge of UPDATION/DELETION/ADDITION of Documents into > MergerIndex, the Code of mine > > > runs PERFECTLY with out any Problems. > > > But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet > Running in SINGLE THREAD MODE,Some times > > > Frequently I get the Error as below - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
LUCENE + EXCEPTION
Hi Guys Apologies.. On STANDALONE Usge of UPDATION/DELETION/ADDITION of Documents into MergerIndex, the Code of mine runs PERFECTLY with out any Problems. But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet Running in SINGLE THREAD MODE,Some times Frequently I get the Error as below java.io.IOException: read past EOF at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:218) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) Somebody Please tell me Why is this happening O/s = Jentoo JAVA = Jdk 1.4.2 WEBAPP = TOMCAT Lucene = 1.4.3 Thx in advance Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK]
Re: Sort Performance Problems across large dataset
Peter, Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results) You may want to try something like the following (I do this in FishEye, seems to be performant for moderately large field-spaces). Use a custom HitCollector, and store all the matching doc-ids in a java.util.BitSet. This will still give you your 0.2second performance. Then, use a TermDocs iterator to visit each term in your "species name" field, "printing out" (or whatever) each species name if it contains a docid in your bitset. Something like this pseudocode: BitSet docs = doSearch(query); // 0.2seconds TermEnum te = reader.terms(new Term("species-name", "")); TermDocs td = reader.termDocs(); Term t = te.term(); while (t!=null && t.field().equals("species-name")) { td.seek(te); while (td.next()) { int docid = td.doc(); if (docs.get(docid)) { print "match:" + docid; break; // try next term } } if (!te.next()) { break; } t = te.term(); } te.close(); td.close(); Now, with 2.3 million (or 4 million!) species names, I'm not sure how fast it will be to iterate through all the "species-name" termdocs. But I would be interested to find out; if you give this a code a try, could you report back your results? =Matt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort Performance Problems across large dataset
On Jan 24, 2005, at 7:01 PM, Peter Hollas wrote: I am working on a public accessible Struts based Well there's the problem right there :)) (just kidding) To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying. This solution works fine, but is unacceptable since a query that returns thousands of results can take upwards of 30 seconds to sort them. 30 seconds... wow. My question is whether it is possible to somehow return the names in alphabetical order without using a String SortField. My last resort will be to perform a monthly index rebuild, and return results by index order (about a day to re-index!). But ideally there might be a way to modify the Lucene API to incorporate a scoring system in a way that scores by lexical order. What about assigning a numeric value field for each document with the number indicating the alphabetical ordering? Off the top of my head, I'm not sure how this could be done, but perhaps some clever hashing algorithm could do this? Or consider each character position one digit in a base 27 (or 27 to include a space) and construct a number for that? (though that would be an enormous number and probably too large) - sorry my off-the-cuff estimating skills are not what they should be. Certainly sorting by a numeric value is far less resource intensive than by String - so perhaps that is worth a try? At the very least, give each document a random number and try sorting by that field (the value of the field can be Integer.toString()) to see how it compares performance-wise. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort Performance Problems across large dataset
Hi Peter, I just got on the list a few hours ago. I am still reading the source code. I am not going to send this to the list. I would like to know the ".2 sec" query time for 2 million fields, should it display only the first page (100 or so), not the whole 3000 found? It is very fast I agree. If the alphabetic index display only a link, not the content, then it should not be very slow since you only need to sort part of what a user need. May be display only the first "A" page, as it did with the regular scored results. Just my thought, might not work for you. Do you store the Lucene index in the database or in a text file? Best, Sharon LangPower Computing, Inc. http://www.indexingonline.com Peter Hollas <[EMAIL PROTECTED]> wrote: I am working on a public accessible Struts based species database project where the number of species names is currently at 2.3 million, and in the near future will be somewhere nearer 4 million (probably the largest there is). The species names are typically 1 to 7 words in length, and the broad requirement is to be able to do a fulltext search across them. It is also necessary to sort the results into alphabetical order by species name. Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results) with the Lucene index that we have built. Lucene gives a much more predictable and faster average query time than using standard fulltext indexing with mySQL. This however returns result in score order, and not alphabetically. To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying. This solution works fine, but is unacceptable since a query that returns thousands of results can take upwards of 30 seconds to sort them. My question is whether it is possible to somehow return the names in alphabetical order without using a String SortField. My last resort will be to perform a monthly index rebuild, and return results by index order (about a day to re-index!). But ideally there might be a way to modify the Lucene API to incorporate a scoring system in a way that scores by lexical order. Any ideas are appreciated! Many thanks, Peter. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort Performance Problems across large dataset
Hi, do you optimize the index? Do you tried to implement a own hit collector? Stefan Am 25.01.2005 um 01:01 schrieb Peter Hollas: I am working on a public accessible Struts based species database project where the number of species names is currently at 2.3 million, and in the near future will be somewhere nearer 4 million (probably the largest there is). The species names are typically 1 to 7 words in length, and the broad requirement is to be able to do a fulltext search across them. It is also necessary to sort the results into alphabetical order by species name. Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results) with the Lucene index that we have built. Lucene gives a much more predictable and faster average query time than using standard fulltext indexing with mySQL. This however returns result in score order, and not alphabetically. To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying. This solution works fine, but is unacceptable since a query that returns thousands of results can take upwards of 30 seconds to sort them. My question is whether it is possible to somehow return the names in alphabetical order without using a String SortField. My last resort will be to perform a monthly index rebuild, and return results by index order (about a day to re-index!). But ideally there might be a way to modify the Lucene API to incorporate a scoring system in a way that scores by lexical order. Any ideas are appreciated! Many thanks, Peter. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- company:http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sort Performance Problems across large dataset
I am working on a public accessible Struts based species database project where the number of species names is currently at 2.3 million, and in the near future will be somewhere nearer 4 million (probably the largest there is). The species names are typically 1 to 7 words in length, and the broad requirement is to be able to do a fulltext search across them. It is also necessary to sort the results into alphabetical order by species name. Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results) with the Lucene index that we have built. Lucene gives a much more predictable and faster average query time than using standard fulltext indexing with mySQL. This however returns result in score order, and not alphabetically. To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying. This solution works fine, but is unacceptable since a query that returns thousands of results can take upwards of 30 seconds to sort them. My question is whether it is possible to somehow return the names in alphabetical order without using a String SortField. My last resort will be to perform a monthly index rebuild, and return results by index order (about a day to re-index!). But ideally there might be a way to modify the Lucene API to incorporate a scoring system in a way that scores by lexical order. Any ideas are appreciated! Many thanks, Peter. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate hits using ParallelMultiSearcher
Agreed on the "set of unique messages", however the problem I have is with the "count" of the Hits. The Hits object may contain 100 results (for example), of which only 90 are unique. Because I am paging through results 10 at a time, I need to know the total count without loading each document. If I get a count of 100 but a Collection of only 90 my paging breaks. After careful consideration I have decided that the better approach is to create a separate "global" index in which all messages are stored. This will not only relieve my duplication issue but should also scale better if/when there are several hundred or several thousand distinct indexes. Thanks, - JP - Original Message - From: "PA" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, January 24, 2005 10:43 PM Subject: Re: Duplicate hits using ParallelMultiSearcher On Jan 24, 2005, at 09:14, Jason Polites wrote: I am aware of the Filter object however the unique identifier of my document is a field within the lucene document itself (messageid); and I am reluctant to access this field using the public API for every Hit as I fear it will have drastic performance implications. Well... I don't see any way around that as you basically want to uniquely identify your messages based on their Message-ID. That said, you don't need to do it during the search itself. You could simply perform your search as you do now and then create a set of unique messages while preserving Lucene Hits sort ordering for "relevance" purpose. HTH. Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering w/ Multiple Terms
After re-reading the book (again), and the javadocs (again), it dawned on my little brain that I needed to have a doc and freq array *the size of maxDocs* for the index reader. I also needed to iterate through the docs array and call bitSet.set for each entry in docs (that was valid, of course). Everything is good now Thanks! Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Monday, January 24, 2005 1:27 PM > To: Lucene Users List > Subject: Re: Filtering w/ Multiple Terms > > > As Paul suggested, output the Lucene document numbers from your Hits, > and also output which bit you're setting in your filter. Do > those sets > overlap? > > Erik > > On Jan 24, 2005, at 2:13 PM, Jerry Jalenak wrote: > > > Paul / Erik - > > > > I'm use the ParallelMultiSearcher to search three indexes > concurrently > > - > > hence the three entries into AccountFilter. If I remove the filter > > from my > > query, and simply enter the query on the command line, I > get two hits > > back. > > In other words, I can enter this: > > > > smith AND (account:0011) > > > > and get hits back. When I add the filter back in (which > should take > > care of > > the account:0011 part of the query), and enter only smith > as my query, > > I get > > 0 hits. > > > > > > > > Jerry Jalenak > > Senior Programmer / Analyst, Web Publishing > > LabOne, Inc. > > 10101 Renner Blvd. > > Lenexa, KS 66219 > > (913) 577-1496 > > > > [EMAIL PROTECTED] > > > > > >> -Original Message- > >> From: Erik Hatcher [mailto:[EMAIL PROTECTED] > >> Sent: Monday, January 24, 2005 1:07 PM > >> To: Lucene Users List > >> Subject: Re: Filtering w/ Multiple Terms > >> > >> > >> > >> On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote: > >>> I spent some time reading the Lucene in Action book this weekend > >>> (great job, > >>> btw) > >> > >> Thanks! > >> > >>> public class AccountFilter extends Filter > >>> I see where the AccountFilter is setting the cooresponding > >> 'bits', but > >>> I end > >>> up without any 'hits': > >>> > >>> Entering AccountFilter... > >>> Entering AccountFilter... > >>> Entering AccountFilter... > >>> Setting bit on > >>> Setting bit on > >>> Setting bit on > >>> Setting bit on > >>> Setting bit on > >>> Leaving AccountFilter... > >>> Leaving AccountFilter... > >>> Leaving AccountFilter... > >>> ... Found 0 matching documents in 1000 ms > >>> > >>> Can anyone tell me what I've done wrong? > >> > >> A filter constrains which documents will be consulted during > >> a search, > >> but the Query needs to match some documents that are > turned on by the > >> filter bits. I'm guessing that your Query did not match any of the > >> documents you turned on. > >> > >>Erik > >> > >> > >> > - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: > [EMAIL PROTECTED] > >> > >> > > > > This transmission (and any information attached to it) may be > > confidential and > > is intended solely for the use of the individual or entity > to which it > > is > > addressed. If you are not the intended recipient or the person > > responsible for > > delivering the transmission to the intended recipient, be > advised that > > you > > have received this transmission in error and that any use, > > dissemination, > > forwarding, printing, or copying of this information is strictly > > prohibited. > > If you have received this transmission in error, please immediately > > notify > > LabOne at the following email address: > > [EMAIL PROTECTED] > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE
Pierrick Brihaye wrote: Hi, David Spencer a écrit : One example of expansion with the synonym boost set to 0.9 is the query "big dog" expands to: Interesting. Do you plan to add expansion on other Wordnet relationships ? Hypernyms and hyponyms would be a good start point for thesaurus-like search, wouldn't it ? Good point, I hadn't considered this - but how would it work -just consider these 2 relationships "synonyms" (thus easier to use) or make it separate (too academic?) However, I'm afraid that this kind of feature would require refactoring, probably based on WordNet-dedicated libraries. JWNL (http://jwordnet.sourceforge.net/) may be a good candidate for this. Good point, should leverage existing code. Thank you for your work. thx, Dave Cheers, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
As Paul suggested, output the Lucene document numbers from your Hits, and also output which bit you're setting in your filter. Do those sets overlap? Erik On Jan 24, 2005, at 2:13 PM, Jerry Jalenak wrote: Paul / Erik - I'm use the ParallelMultiSearcher to search three indexes concurrently - hence the three entries into AccountFilter. If I remove the filter from my query, and simply enter the query on the command line, I get two hits back. In other words, I can enter this: smith AND (account:0011) and get hits back. When I add the filter back in (which should take care of the account:0011 part of the query), and enter only smith as my query, I get 0 hits. Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, January 24, 2005 1:07 PM To: Lucene Users List Subject: Re: Filtering w/ Multiple Terms On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote: I spent some time reading the Lucene in Action book this weekend (great job, btw) Thanks! public class AccountFilter extends Filter I see where the AccountFilter is setting the cooresponding 'bits', but I end up without any 'hits': Entering AccountFilter... Entering AccountFilter... Entering AccountFilter... Setting bit on Setting bit on Setting bit on Setting bit on Setting bit on Leaving AccountFilter... Leaving AccountFilter... Leaving AccountFilter... ... Found 0 matching documents in 1000 ms Can anyone tell me what I've done wrong? A filter constrains which documents will be consulted during a search, but the Query needs to match some documents that are turned on by the filter bits. I'm guessing that your Query did not match any of the documents you turned on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering w/ Multiple Terms
Paul / Erik - I'm use the ParallelMultiSearcher to search three indexes concurrently - hence the three entries into AccountFilter. If I remove the filter from my query, and simply enter the query on the command line, I get two hits back. In other words, I can enter this: smith AND (account:0011) and get hits back. When I add the filter back in (which should take care of the account:0011 part of the query), and enter only smith as my query, I get 0 hits. Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Monday, January 24, 2005 1:07 PM > To: Lucene Users List > Subject: Re: Filtering w/ Multiple Terms > > > > On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote: > > I spent some time reading the Lucene in Action book this weekend > > (great job, > > btw) > > Thanks! > > > public class AccountFilter extends Filter > > I see where the AccountFilter is setting the cooresponding > 'bits', but > > I end > > up without any 'hits': > > > > Entering AccountFilter... > > Entering AccountFilter... > > Entering AccountFilter... > > Setting bit on > > Setting bit on > > Setting bit on > > Setting bit on > > Setting bit on > > Leaving AccountFilter... > > Leaving AccountFilter... > > Leaving AccountFilter... > > ... Found 0 matching documents in 1000 ms > > > > Can anyone tell me what I've done wrong? > > A filter constrains which documents will be consulted during > a search, > but the Query needs to match some documents that are turned on by the > filter bits. I'm guessing that your Query did not match any of the > documents you turned on. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote: I spent some time reading the Lucene in Action book this weekend (great job, btw) Thanks! public class AccountFilter extends Filter I see where the AccountFilter is setting the cooresponding 'bits', but I end up without any 'hits': Entering AccountFilter... Entering AccountFilter... Entering AccountFilter... Setting bit on Setting bit on Setting bit on Setting bit on Setting bit on Leaving AccountFilter... Leaving AccountFilter... Leaving AccountFilter... ... Found 0 matching documents in 1000 ms Can anyone tell me what I've done wrong? A filter constrains which documents will be consulted during a search, but the Query needs to match some documents that are turned on by the filter bits. I'm guessing that your Query did not match any of the documents you turned on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
Jerry, On Monday 24 January 2005 18:26, Jerry Jalenak wrote: > I spent some time reading the Lucene in Action book this weekend (great job, > btw), and came across the section on using custom filters. Since the data > that I need to use to filter my hit set with comes from a database, I > thought it would be worth my effort this morning to write a custom filter > that would handle the filtering for me. So, using the example from the book > (page 210), I've coded an AccountFilter: > > public class AccountFilter extends Filter > { > public AccountFilter() > {} > > public BitSet bits(IndexReader indexReader) > throws IOException > { > System.out.println("Entering AccountFilter..."); > BitSet bitSet = new BitSet(indexReader.maxDoc()); > > String[] reportingAccounts = new String[] {"0011", "4kfs"}; > > int[] docs = new int[1]; > int[] freqs = new int[1]; > > for (int i = 0; i < reportingAccounts.length; i++) > { > String reportingAccount = reportingAccounts[i]; > if (reportingAccount != null) > { > TermDocs termDocs = indexReader.termDocs(new > Term("account", reportingAccount)); > int count = termDocs.read(docs, freqs); > if (count == 1) Unless "account" is a primary key fied, it's better to loop over the termdocs. > { > System.out.println("Setting bit > on"); > bitSet.set(docs[0]); > } > } > } > System.out.println("Leaving AccountFilter..."); > return bitSet; > } > } > > I see where the AccountFilter is setting the cooresponding 'bits', but I end > up without any 'hits': > > Entering AccountFilter... > Entering AccountFilter... > Entering AccountFilter... > Setting bit on > Setting bit on > Setting bit on > Setting bit on > Setting bit on > Leaving AccountFilter... > Leaving AccountFilter... > Leaving AccountFilter... I don't see any recursion in your code, but this output suggests nesting three deep. Something does not add up here. > ... Found 0 matching documents in 1000 ms > > Can anyone tell me what I've done wrong? Maybe all query hits were filtered out? Could you compare the docnrs in the bits of the filter with the unfiltered query hits docnrs? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: keep indexes as files or save them in database
On Sun, 2005-01-23 at 22:09 -0800, Otis Gospodnetic wrote: A number of people have tried putting Lucene indices in RDBMS. As far as I know, all were slower than FSDirectory. Do you know if the Berkeley DB back end also has a performance hit? Try it, it all depends on how you configure it. And that depends on your needs. I posted examples to the list last week. Andi.. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering w/ Multiple Terms
I spent some time reading the Lucene in Action book this weekend (great job, btw), and came across the section on using custom filters. Since the data that I need to use to filter my hit set with comes from a database, I thought it would be worth my effort this morning to write a custom filter that would handle the filtering for me. So, using the example from the book (page 210), I've coded an AccountFilter: public class AccountFilter extends Filter { public AccountFilter() {} public BitSet bits(IndexReader indexReader) throws IOException { System.out.println("Entering AccountFilter..."); BitSet bitSet = new BitSet(indexReader.maxDoc()); String[] reportingAccounts = new String[] {"0011", "4kfs"}; int[] docs = new int[1]; int[] freqs = new int[1]; for (int i = 0; i < reportingAccounts.length; i++) { String reportingAccount = reportingAccounts[i]; if (reportingAccount != null) { TermDocs termDocs = indexReader.termDocs(new Term("account", reportingAccount)); int count = termDocs.read(docs, freqs); if (count == 1) { System.out.println("Setting bit on"); bitSet.set(docs[0]); } } } System.out.println("Leaving AccountFilter..."); return bitSet; } } I see where the AccountFilter is setting the cooresponding 'bits', but I end up without any 'hits': Entering AccountFilter... Entering AccountFilter... Entering AccountFilter... Setting bit on Setting bit on Setting bit on Setting bit on Setting bit on Leaving AccountFilter... Leaving AccountFilter... Leaving AccountFilter... ... Found 0 matching documents in 1000 ms Can anyone tell me what I've done wrong? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Friday, January 21, 2005 8:15 AM > To: Lucene Users List > Subject: RE: Filtering w/ Multiple Terms > > > This: > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/se > arch/BooleanQuery.TooManyClauses.html > ? > > You can control that limit via > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/se > arch/BooleanQuery.html#maxClauseCount > > Otis > > > --- Jerry Jalenak <[EMAIL PROTECTED]> wrote: > > > OK. But isn't there a limit on the number of > BooleanQueries that can > > be > > combined with AND / OR / etc? > > > > > > > > Jerry Jalenak > > Senior Programmer / Analyst, Web Publishing > > LabOne, Inc. > > 10101 Renner Blvd. > > Lenexa, KS 66219 > > (913) 577-1496 > > > > [EMAIL PROTECTED] > > > > > > > -Original Message- > > > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > > > Sent: Thursday, January 20, 2005 5:05 PM > > > To: Lucene Users List > > > Subject: Re: Filtering w/ Multiple Terms > > > > > > > > > > > > On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote: > > > > > > > In looking at the examples for filtering of hits, it looks > > > like I can > > > > only > > > > specify a single term; i.e. > > > > > > > > Filter f = new QueryFilter(new TermQuery(new > Term("acct", > > > > "acct1"))); > > > > > > > > I need to specify more than one term in my filter. Short of > > using > > > > something > > > > like ChainFilter, how are others handling this? > > > > > > You can make as complex of a Query as you want for > > > QueryFilter. If you > > > want to filter on multiple terms, construct a BooleanQuery > > > with nested > > > TermQuery's, either in an AND or OR fashion. > > > > > > Erik > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > This transmission (and any information attached to it) may be > > confidential and > > is intended solely for the use of the individual or entity to which > > it is > > addressed. If you are not the intended recipient or the person > > responsible for > > delivering the transmission to the intended recipient, be advised > > that you > > have received this transmission in error and that any use, > > dissemination, > > forwarding, printing, or copying of this information is strictly > > prohibited. > > If you have received this transmission in error, please immediately > > notify > > LabOne at the following email address: > > [EMAIL PROTECTED] > > > > > > >
Re: Stemming
On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote: Do stemming algorithms take into consideration abbreviations too? No, they don't. Adding abbreviations, aliases, synonyms, etc is not stemming. And, the next logical question, if stemming does not take care of abbreviations, are there any solutions that include abbreviations inside or outside of Lucene? Nothing built into Lucene does this, but the infrastructure allows it to be added in the form of a custom analysis step. There are two basic approaches, adding aliases at indexing time, or adding them at query time by expanding the query. I created some example analyzers in Lucene in Action (grab the source code from the site linked below) that demonstrate how this can be done using WordNet (and mock) synonym lookup. You could extrapolate this into looking up abbreviations and adding them into the token stream. http://www.lucenebook.com/search?query=synonyms Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stemming
Do stemming algorithms take into consideration abbreviations too? Some examples: mg = milligrams US = United States CD = compact disc vcr = video casette recorder And, the next logical question, if stemming does not take care of abbreviations, are there any solutions that include abbreviations inside or outside of Lucene? Thanks, Kevin -Original Message- From: Chris Lamprecht [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 5:51 PM To: Lucene Users List Subject: Re: Stemming Also if you can't wait, see page 2 of http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html or the LIA e-book ;) On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb <[EMAIL PROTECTED]> wrote: > OK, OK ... I'll buy the book. I guess its about time since I am deeply > and forever in love with Lucene. Might as well take the final plunge. > > > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Friday, January 21, 2005 9:12 AM > To: Lucene Users List > Subject: Re: Stemming > > Hi Kevin, > > Stemming is an optional operation and is done in the analysis step. > Lucene comes with a Porter stemmer and a Filter that you can use in an > Analyzer: > > ./src/java/org/apache/lucene/analysis/PorterStemFilter.java > ./src/java/org/apache/lucene/analysis/PorterStemmer.java > > You can find more about it here: > http://www.lucenebook.com/search?query=stemming > You can also see mentions of SnowballAnalyzer in those search results, > and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. > > Otis > > --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote: > > > I want to understand how Lucene uses stemming but can't find any > > documentation on the Lucene site. I'll continue to google but hope > > that > > this list can help narrow my search. I have several questions on the > > subject currently but hesitate to list them here since finding a good > > document on the subject may answer most of them. > > > > > > > > Thanks in advance for any pointers, > > > > > > > > Kevin > > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate hits using ParallelMultiSearcher
On Jan 24, 2005, at 09:14, Jason Polites wrote: I am aware of the Filter object however the unique identifier of my document is a field within the lucene document itself (messageid); and I am reluctant to access this field using the public API for every Hit as I fear it will have drastic performance implications. Well... I don't see any way around that as you basically want to uniquely identify your messages based on their Message-ID. That said, you don't need to do it during the search itself. You could simply perform your search as you do now and then create a set of unique messages while preserving Lucene Hits sort ordering for "relevance" purpose. HTH. Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: keep indexes as files or save them in database
On Sun, 2005-01-23 at 22:09 -0800, Otis Gospodnetic wrote: > A number of people have tried putting Lucene indices in RDBMS. As far > as I know, all were slower than FSDirectory. Do you know if the Berkeley DB back end also has a performance hit? -- Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Duplicate hits using ParallelMultiSearcher
Hello all, I am looking for a strategy to exclude duplicate entries when searching multiple indexes which may contain the same document. I have an email system which archives and indexes emails on a per-recipient basis. So, each email recipient has their own index. In the case where the same email is delivered to more than one recipient, each recipient's index stores a record of effectively the same document. Now, there is a requirement to perform a search across multiple indexes, for which I am using the ParallelMultiSearcher. The problem is that this results in duplicate entries in the Hits returned. I can easily transfer the results into some form of java.util.Set to guarantee uniqueness, however I have a problem with the length() of the Hits object returned. Ideally I need a way of filtering the Hits based on a "no duplicate" rule. I am aware of the Filter object however the unique identifier of my document is a field within the lucene document itself (messageid); and I am reluctant to access this field using the public API for every Hit as I fear it will have drastic performance implications. The ideal solution for me would be to specify a field during the search which is guaranteed to be unique across the Hits returned. Anyone know of an elegant way to do this? Alternatively is there a way I can de-dupe the list myself without loading every document? Apologies for the length of this question. P.S. The separation of indexes per-recipient is a mandatory requirement. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]