RE: [Urgent] deleteDocuments fails after merging ...
>> Is that the same reader that is used in IndexSearcher? I opened an IndexSearcher on the path (String) to the index. Now I tried to open on the clone IndexReader and use the constructor that has an IndexReader as param, and I got everything working now I just have two IndexSearchers opened now most of the time, which is deprecated, But I think that's my only choice ! Thank u ! __ Matt -Original Message- From: Antony Bowesman [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 13, 2007 10:13 PM To: java-user@lucene.apache.org Subject: Re: [Urgent] deleteDocuments fails after merging ... * This message comes from the Internet Network * Erick Erickson wrote: > The javadocs point out that this line > > * int* nb = mIndexReaderClone.deleteDocuments(urlTerm) > > removes*all* documents for a given term. So of course you'll fail > to delete any documents the second time you call > deleteDocuments with the same term. Isn't the code snippet below doing a search before attempting the deletion, so from the IndexReader's point of view (as used by the IndexSearcher) the item exists. What is mIndexReaderClone? Is that the same reader that is used in IndexSearcher? I'm not sure, but if you search with one IndexReader and delete the document using another IndexReader and then repeat the process, I think that the search would still result in a hit, but the deletion would return 0. > On 3/13/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote: >> >> Before I delete a document I search it in the index to be sure there is a >> hit (via a Term object), >> When I find a hit I delete the document (with the same Term object), >> Hits hits = search(query); >> *if* (hits.length() > 0) { >>* if* (hits.length() > 1) { >> System.out.println("found in the index with duplicates"); >> } >> System.out.println("found in the index"); >>* try* { >>* int* nb = mIndexReaderClone.deleteDocuments(urlTerm); >>* if* (nb > 0) >> System.out.println("successfully deleted"); >>* else* >>* throw** new* IOException("0 doc deleted"); >> }* catch* (IOException e) { >> e.printStackTrace(); >>* throw** new* Exception( >> Thread.currentThread().getName() + " --- Deleting Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Can we extract phrase from lucene index
Hello guys, I am using lucene 1.9 and i have 3GB of index. I know we can extract tokens from index easily but can we extract phrase ? Regards. Bhavin pandya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ways to minimize index size?
Hi, I want to make my index as small as possible. I noticed about field.setOmitNorms(true), I read in the list the diff is 1 byte per field per doc, not huge but hey...is the only effect the score being different? I hardly mind about the score so that would be ok. And can I add to an index without norms when it has previous doc with norms? Any other way to minimize size of index? Most of my fields but one are Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I tried compressing that one and size is reduced around 1% (it's a small field), but I guess compression means worse performance so I am not sure about applying that. thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexReader.GetTermFreqVectors
Yes but what is a term vector? -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: 13 March 2007 19:28 To: java-user@lucene.apache.org Subject: Re: IndexReader.GetTermFreqVectors It means it return the term vectors for all the fields on that document where you have enabled TermVector when creating the Document. i.e. new Field(, TermVector.YES) (see http://lucene.apache.org/ java/docs/api/org/apache/lucene/document/Field.TermVector.html for the full array of options) -Grant On Mar 13, 2007, at 1:24 PM, Kainth, Sachin wrote: > Hi all, > > The documentation for the above method mentions something called a > vectorized field. Does anyone know what a vectorized field is? > > > > > This email and any attached files are confidential and copyright > protected. If you are not the addressee, any dissemination of this > communication is strictly prohibited. Unless otherwise expressly > agreed in writing, nothing stated in this communication shall be > legally binding. > > The ultimate parent company of the Atkins Group is WS Atkins plc. > Registered in England No. 1885586. Registered Office Woodcote Grove, > Ashley Road, Epsom, Surrey KT18 5BW. > > Consider the environment. Please don't print this e-mail unless you > really need to. -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexReader.GetTermFreqVectors
From http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.TermVector.html: "A term vector is a list of the document's terms and their number of occurences in that document." -- Ian. On 3/14/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote: Yes but what is a term vector? -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: 13 March 2007 19:28 To: java-user@lucene.apache.org Subject: Re: IndexReader.GetTermFreqVectors It means it return the term vectors for all the fields on that document where you have enabled TermVector when creating the Document. i.e. new Field(, TermVector.YES) (see http://lucene.apache.org/ java/docs/api/org/apache/lucene/document/Field.TermVector.html for the full array of options) -Grant On Mar 13, 2007, at 1:24 PM, Kainth, Sachin wrote: > Hi all, > > The documentation for the above method mentions something called a > vectorized field. Does anyone know what a vectorized field is? > > > > > This email and any attached files are confidential and copyright > protected. If you are not the addressee, any dissemination of this > communication is strictly prohibited. Unless otherwise expressly > agreed in writing, nothing stated in this communication shall be > legally binding. > > The ultimate parent company of the Atkins Group is WS Atkins plc. > Registered in England No. 1885586. Registered Office Woodcote Grove, > Ashley Road, Epsom, Surrey KT18 5BW. > > Consider the environment. Please don't print this e-mail unless you > really need to. -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can we extract phrase from lucene index
Your problem statement lends itself to flippant answers like "just use a PhraseQuery". So I clearly don't understand what you're trying to accomplish. Are you trying to find all of the occurrences of a particular phrase? All the phrases (however that's defined) for all the documents? What problem are you trying to solve? Best Erick On 3/14/07, Bhavin Pandya <[EMAIL PROTECTED]> wrote: Hello guys, I am using lucene 1.9 and i have 3GB of index. I know we can extract tokens from index easily but can we extract phrase ? Regards. Bhavin pandya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how to get approximate total matching
Hi. I have more index directories (>6) all in GB,and searching my query with single IndexSearcher to all indexes one after another.i.e. I create one IndexSearcher for index1 and search over that.Finally I close that and create new IndexSearcher for index2 and so on. If i get 200 total results then i don't go to search other index directories and i print 200 results and exit from search. I need to get approximate total matching documents all over the indexes without going to search in other indexes. Please suggest me a easiest way to achieve this. P.S: To avoid more memory usage and to reduce search timeI don't want to search my query through all indexes if i got 200 results. MultiSearcher create OOM error, so that I'm using single IndexSearcher. Thanks in Advance Senthil
Re: ways to minimize index size?
Store as little as possible, index as little as possible . How big is your index, and how much do you expect it to grow? I ask this because it's probably not worth your time to try to reduce the index size below some threshold... I found that reducing my index from 8G to 4G (through not stemming) gave me about a 10% performance improvement, so at some point it's just not worth the effort. Also, if you posted the index size, it would give folks a chance to say "there's not much you can gain by reducing things more". As it is, I don't have a clue whether your index is 100M or 100T. The former is in the "don't waste your time" class, and the latter is...er... different I wouldn't bother compressing for 1% Question for "the guys" so I can check an assumption Is there any difference between these two? Field(Name, Value, Store, index) * *Field(Name, Value, Store, index, Field.TermVector.NO) Best Erick On 3/14/07, jm <[EMAIL PROTECTED]> wrote: Hi, I want to make my index as small as possible. I noticed about field.setOmitNorms(true), I read in the list the diff is 1 byte per field per doc, not huge but hey...is the only effect the score being different? I hardly mind about the score so that would be ok. And can I add to an index without norms when it has previous doc with norms? Any other way to minimize size of index? Most of my fields but one are Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I tried compressing that one and size is reduced around 1% (it's a small field), but I guess compression means worse performance so I am not sure about applying that. thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how to get approximate total matching
How much memory are you allocating for your JVM? Because you're paying a huge search time penalty by opening and closing your searcher sequentially, it would be a good thing to not do this. But, as you say, if you're getting OOM errors, that's a problem. What is the total size of all your indexes? That would help folks give you better responses and perhaps suggest other ways of solving your problem. Erick On 3/14/07, senthil kumaran <[EMAIL PROTECTED]> wrote: Hi. I have more index directories (>6) all in GB,and searching my query with single IndexSearcher to all indexes one after another.i.e. I create one IndexSearcher for index1 and search over that.Finally I close that and create new IndexSearcher for index2 and so on. If i get 200 total results then i don't go to search other index directories and i print 200 results and exit from search. I need to get approximate total matching documents all over the indexes without going to search in other indexes. Please suggest me a easiest way to achieve this. P.S: To avoid more memory usage and to reduce search timeI don't want to search my query through all indexes if i got 200 results. MultiSearcher create OOM error, so that I'm using single IndexSearcher. Thanks in Advance Senthil
RE: ways to minimize index size?
I found that reducing my index from 8G to 4G (through not stemming) gave me about a 10% performance improvement. How did you do this? I don't see this as an option. Jeff
Re: Can we extract phrase from lucene index
Hi erick, what i am looking for is dictionary for spell checker. I am trying to customised lucene spell checker for phrase. so thinking if anyhow i am able to fetech phrases from the index itself then i can train my spellchecker. I tried with query logs but it has lot of spell mistakes... Any suggestions.. Thanks. Bhavin pandya - Original Message - From: "Erick Erickson" <[EMAIL PROTECTED]> To: ; "Bhavin Pandya" <[EMAIL PROTECTED]> Sent: Wednesday, March 14, 2007 6:29 PM Subject: Re: Can we extract phrase from lucene index Your problem statement lends itself to flippant answers like "just use a PhraseQuery". So I clearly don't understand what you're trying to accomplish. Are you trying to find all of the occurrences of a particular phrase? All the phrases (however that's defined) for all the documents? What problem are you trying to solve? Best Erick On 3/14/07, Bhavin Pandya <[EMAIL PROTECTED]> wrote: Hello guys, I am using lucene 1.9 and i have 3GB of index. I know we can extract tokens from index easily but can we extract phrase ? Regards. Bhavin pandya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
memory consumption on large indices
Do I have to keep something in mind to do searching on large indices? I actually have an index with a size of 1.8gb. I have indexed 1.5 million items from Amazon. How much memory do I have to give to the jvm? As a sidenote I have to tell you that I optimized the index so it's one segment file. Do I need to have 1.8gb memory available for the jvm? regards, -Dennis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: memory consumption on large indices
No, you don't need 1.8Gb of memory. Start with default and raise if you need to? Or jump straight in at about 512Mb. -- Ian. On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote: Do I have to keep something in mind to do searching on large indices? I actually have an index with a size of 1.8gb. I have indexed 1.5 million items from Amazon. How much memory do I have to give to the jvm? As a sidenote I have to tell you that I optimized the index so it's one segment file. Do I need to have 1.8gb memory available for the jvm? regards, -Dennis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ways to minimize index size?
hi Erick, Well, typically my application will start with some hundreds of indexes...and then grow at a rate of several per day, for ever. At some point I know I can do some merging etc if needed. Size is dependant on the customer, could be up to a 1G per index. That is way I would like to minimize them. I am not worried with search performance. I dont understand how not stemming can reduce the size of an index...I would think it happens the other way, does not stemming makes the words shorter? (I dont stemm, so I never looked into it) thanks On 3/14/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Store as little as possible, index as little as possible . How big is your index, and how much do you expect it to grow? I ask this because it's probably not worth your time to try to reduce the index size below some threshold... I found that reducing my index from 8G to 4G (through not stemming) gave me about a 10% performance improvement, so at some point it's just not worth the effort. Also, if you posted the index size, it would give folks a chance to say "there's not much you can gain by reducing things more". As it is, I don't have a clue whether your index is 100M or 100T. The former is in the "don't waste your time" class, and the latter is...er... different I wouldn't bother compressing for 1% Question for "the guys" so I can check an assumption Is there any difference between these two? Field(Name, Value, Store, index) * *Field(Name, Value, Store, index, Field.TermVector.NO) Best Erick On 3/14/07, jm <[EMAIL PROTECTED]> wrote: > > Hi, > > I want to make my index as small as possible. I noticed about > field.setOmitNorms(true), I read in the list the diff is 1 byte per > field per doc, not huge but hey...is the only effect the score being > different? I hardly mind about the score so that would be ok. > > And can I add to an index without norms when it has previous doc with > norms? > > Any other way to minimize size of index? Most of my fields but one are > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I > tried compressing that one and size is reduced around 1% (it's a small > field), but I guess compression means worse performance so I am not > sure about applying that. > > thanks > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: memory consumption on large indices
Ian Lea schrieb: No, you don't need 1.8Gb of memory. Start with default and raise if you need to? how do I know when I need it? Or jump straight in at about 512Mb. -- Ian. On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote: Do I have to keep something in mind to do searching on large indices? I actually have an index with a size of 1.8gb. I have indexed 1.5 million items from Amazon. How much memory do I have to give to the jvm? As a sidenote I have to tell you that I optimized the index so it's one segment file. Do I need to have 1.8gb memory available for the jvm? regards, -Dennis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Dennis Berger BSDSystems Eduardstrasse 43b 20257 Hamburg Phone: +49 (0)40 54 00 18 17 Mobile: +49 (0) 179 123 15 09 E-Mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: memory consumption on large indices
When your app gets a java.lang.OutOfMemory exception. -- Ian. On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote: Ian Lea schrieb: > No, you don't need 1.8Gb of memory. Start with default and raise if > you need to? how do I know when I need it? > Or jump straight in at about 512Mb. > > > -- > Ian. > > > On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote: >> Do I have to keep something in mind to do searching on large indices? >> I actually have an index with a size of 1.8gb. I have indexed 1.5 >> million items from Amazon. >> How much memory do I have to give to the jvm? >> As a sidenote I have to tell you that I optimized the index so it's one >> segment file. >> Do I need to have 1.8gb memory available for the jvm? >> >> regards, >> -Dennis >> - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can we extract phrase from lucene index
14 mar 2007 kl. 14.51 skrev Bhavin Pandya: what i am looking for is dictionary for spell checker. I am trying to customised lucene spell checker for phrase. so thinking if anyhow i am able to fetech phrases from the index itself then i can train my spellchecker. I tried with query logs but it has lot of spell mistakes... You can try this: https://issues.apache.org/jira/browse/LUCENE-626 -- karl Any suggestions.. Thanks. Bhavin pandya - Original Message - From: "Erick Erickson" <[EMAIL PROTECTED]> To: ; "Bhavin Pandya" <[EMAIL PROTECTED]> Sent: Wednesday, March 14, 2007 6:29 PM Subject: Re: Can we extract phrase from lucene index Your problem statement lends itself to flippant answers like "just use a PhraseQuery". So I clearly don't understand what you're trying to accomplish. Are you trying to find all of the occurrences of a particular phrase? All the phrases (however that's defined) for all the documents? What problem are you trying to solve? Best Erick On 3/14/07, Bhavin Pandya <[EMAIL PROTECTED]> wrote: Hello guys, I am using lucene 1.9 and i have 3GB of index. I know we can extract tokens from index easily but can we extract phrase ? Regards. Bhavin pandya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: memory consumption on large indices
I'm searching a 20GB index and my searching JVM is allocated 1Gig. However, my indexing app only had 384mb availible to it, which means you can get away with far less. I believe certain index tables will need to be swapped in and out of memory though so it may not search as quickly. With a 1.8gig index you could try the jvm default (64megs) and see how it works. Tim Dennis Berger wrote: Do I have to keep something in mind to do searching on large indices? I actually have an index with a size of 1.8gb. I have indexed 1.5 million items from Amazon. How much memory do I have to give to the jvm? As a sidenote I have to tell you that I optimized the index so it's one segment file. Do I need to have 1.8gb memory available for the jvm? regards, -Dennis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildcard searches with * or ? as the first character - Thanks
Thanks Steven and Antony. I read the FAQ not very long ago, but that slipped my attention. Or perhaps it's a recent change. - Øystein - -- Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <[EMAIL PROTECTED]>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <[EMAIL PROTECTED]>. Aksis home page: . - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ways to minimize index size?
OK, I caused more confusion than rendered help by my stemming statement. The only reason I mentioned it was to illustrate that performance is not linearly related to size. It took some effort to put stemming into the index, see PorterStemmer etc. This is NOT the default. So I took it out to see what the effect would be. Why not stemming made things shorter: because we also have the requirement that phrases (i.e. words in double quotes) do NOT match the stemmed version. Thus if we index running watching, the following searches have the indicated results run - hits watch - hits running - hits "run watch" does NOT hit. "running watching" hits So I indexed the following terms... run running$ watch watching& with the two forms of run indexed in the same position (0) and the two forms of watch in the same position (1). I agree that if we didn't have the exact-phrase-match requirement the stemmed version of the index should be smaller Sorry for the confusion Erick On 3/14/07, jm <[EMAIL PROTECTED]> wrote: hi Erick, Well, typically my application will start with some hundreds of indexes...and then grow at a rate of several per day, for ever. At some point I know I can do some merging etc if needed. Size is dependant on the customer, could be up to a 1G per index. That is way I would like to minimize them. I am not worried with search performance. I dont understand how not stemming can reduce the size of an index...I would think it happens the other way, does not stemming makes the words shorter? (I dont stemm, so I never looked into it) thanks On 3/14/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > Store as little as possible, index as little as possible . > > How big is your index, and how much do you expect it to grow? > I ask this because it's probably not worth your time to try to > reduce the index size below some threshold... I found that > reducing my index from 8G to 4G (through not stemming) gave > me about a 10% performance improvement, so at some point > it's just not worth the effort. Also, if you posted the index size, > it would give folks a chance to say "there's not much you can > gain by reducing things more". As it is, I don't have a clue > whether your index is 100M or 100T. The former is in the > "don't waste your time" class, and the latter is...er... > different > > I wouldn't bother compressing for 1% > > Question for "the guys" so I can check an assumption > Is there any difference between these two? > Field(Name, Value, Store, index) > *< file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29 > > *Field(Name, Value, Store, index, Field.TermVector.NO) > > > Best > Erick > > On 3/14/07, jm <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I want to make my index as small as possible. I noticed about > > field.setOmitNorms(true), I read in the list the diff is 1 byte per > > field per doc, not huge but hey...is the only effect the score being > > different? I hardly mind about the score so that would be ok. > > > > And can I add to an index without norms when it has previous doc with > > norms? > > > > Any other way to minimize size of index? Most of my fields but one are > > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is > > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I > > tried compressing that one and size is reduced around 1% (it's a small > > field), but I guess compression means worse performance so I am not > > sure about applying that. > > > > thanks > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [Urgent] deleteDocuments fails after merging ...
: I just have two IndexSearchers opened now most of the time, which is : deprecated, : But I think that's my only choice ! 2 searchers is fine ... it's "N" where N is not bound that you want to avoid. from what i understand of your requirements, you don't *really* need two searchers open ... open a searcher to do whatever complex queries you need to get the docIds to delete, then delete them all, then close/reopen searcher (and check that the delets worked if you don't trust it) the only real reason you should really need 2 searchers at a time is if you are searching other queries in parallel threads at the same time ... or if you are warming up one new searcher that's "ondeck" while still serving queries with an older searcher. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance between Filter and HitCollector?
it's kind of an Apples/Oranges comparison .. in the examples you gave below, one is executing an arbitrary query (which oculd be anything) the other is doing a simple TermEnumeration. Asuming that Query is a TermQuery, the Filter is theoreticaly going to be faster becuase it does't have to compute any Scores ... generally speaking a a Filter will alwyas be a little faster then a functionally equivilent Query for the purposes of building up a simple BitSet of matching documents because teh Query involves the score calcuations ... but the Query is generally more usable. The Query can also be more efficient in other ways, because the HitCollector doesn't *have* to build a BitSet, it can deal with the results in whatever way it wants (where as a Filter allways generates a BitSet). Solr goes the HitCollector route for a few reasons: 1) allows us to use hte DocSet abstraction which allows other performance benefits over straight BitSets 2) allows us to have simpler code that builds DocSets and DocLists (DocLists know about scores, sorting, and pagination) in a single pass when scores or sorting are requested. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search vs. Rank
Most search engine technologies return result sets based some weighted frequency of the search terms found. I've got a new problem, I want to rank by different criteria than I searched for. For example, I might want to return as my result set all documents that contain the word pizza, but rank them according to topping preferences (with garlic at the top and goat cheese at the bottom). Two questions. 1) Does Lucene allow one to mandatorily search for a term, but provide it zero weight, while allowing other terms to have zero influence on the result set, but affect their order? I'm thinking something like +pizza^0 garlic^1 "goat cheese"^-1 The concern is that I don't want any results that happen to mention garlic or goat cheese except in the context of pizza. 2) Once I have this list of results, can I change their rank order without having to do a full scale search again? -wls
Re: Search vs. Rank
: I'm thinking something like +pizza^0 garlic^1 "goat cheese"^-1 that does in fact work. : 2) Once I have this list of results, can I change their rank order without : having to do a full scale search again? the frequency of "pizza' won't affect the score at all, so you should need to do much to change the order ... but you can implement your own custom Sort to get any order you want. An alternate appraoch is to use a Filter to define the super set of all things you are interested in (ie: "pizza") and then execute a Query against that Filter that matches all docs, scoring the ones you care the most about higher. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance between Filter and HitCollector?
just to complete this fine answer, there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584) that could bring the best of both worlds via e.g. ConstantScoringQuery or another abstraction that enables disabling Scoring (where appropriate) - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 14 March, 2007 7:15:06 PM Subject: Re: Performance between Filter and HitCollector? it's kind of an Apples/Oranges comparison .. in the examples you gave below, one is executing an arbitrary query (which oculd be anything) the other is doing a simple TermEnumeration. Asuming that Query is a TermQuery, the Filter is theoreticaly going to be faster becuase it does't have to compute any Scores ... generally speaking a a Filter will alwyas be a little faster then a functionally equivilent Query for the purposes of building up a simple BitSet of matching documents because teh Query involves the score calcuations ... but the Query is generally more usable. The Query can also be more efficient in other ways, because the HitCollector doesn't *have* to build a BitSet, it can deal with the results in whatever way it wants (where as a Filter allways generates a BitSet). Solr goes the HitCollector route for a few reasons: 1) allows us to use hte DocSet abstraction which allows other performance benefits over straight BitSets 2) allows us to have simpler code that builds DocSets and DocLists (DocLists know about scores, sorting, and pagination) in a single pass when scores or sorting are requested. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ All New Yahoo! Mail – Tired of unwanted email come-ons? Let our SpamGuard protect you. http://uk.docs.yahoo.com/nowyoucan.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Urgent] deleteDocuments fails after merging ...
Chris Hostetter wrote: the only real reason you should really need 2 searchers at a time is if you are searching other queries in parallel threads at the same time ... or if you are warming up one new searcher that's "ondeck" while still serving queries with an older searcher. Hoss, I hope I misunderstood this: are you saying that the same IndexSearcher/IndexReader pair can not be used concurrently against a single index by different threads executing different queries? The archives have several mentions of sharing IndexSearcher among threads and Otis says http://www.jguru.com/faq/view.jsp?EID=492393. Can you clarify what you meant please. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SpellChecker and Lucene 2.1
Is there a SpellChecker.jar compatible with Lucene 2.1. After updating to Lucene 2.1, I seem to have lost the ability to create a spell index using spellchecker-2.0-rc1-dev.jar. Any help would be greatly appreciated. Thanks, Ryan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Urgent] deleteDocuments fails after merging ...
: > the only real reason you should really need 2 searchers at a time is if : > you are searching other queries in parallel threads at the same time ... : > or if you are warming up one new searcher that's "ondeck" while still : > serving queries with an older searcher. : : Hoss, I hope I misunderstood this: are you saying that the same : IndexSearcher/IndexReader pair can not be used concurrently against a single : index by different threads executing different queries? no i'm saying the only reason you need two searhsers are: 1) if, seperate from the searcher you are using to deletes (which you seem to have a use case that involves reopening to check the deletes) you also wnat a searcher open continuously which you use to search search clients. 2) if, for performance reasons, when opening a new searcher to expose a new version of hte index, you want to open the new one, warm it up with some queries, and only then direct new threads to the new searcher and close the old searcher. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fast index traversal and update for stored field?
Hi there, I'm using lucene to index and store entries from a database table for ultimate retrieval as search results. This works fine. But I find myself in the position of wanting to occasionally (daily-ish) bulk- update a single, stored, non-indexed field in every document in the index, without changing any indexed value at all. The obviously documented way to do this would be to remove and then re-add each updated document successively. However, I know from experience that rebuilding our index from scratch in this fashion would take several hours at least, which is too long to delay pending incremental index jobs. It seems to me that at some level it should be possible to iterate over all the document storage on disk and modify only the field I'm interested in (no index modification required remember as this is a field that is stored but not indexed). It's plain from the documentation on file formats that it would be potentially possible to do this from a low level, however before I go possibly re-inventing that wheel, I'm wondering if anyone knows of any existing code out there that would aid in solving this problem. Thanks in advance, //Thomas Thomas K. Burkholder Code Janitor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast index traversal and update for stored field?
If you search the mail archive for "update in place" (no quotes), you'll find extensive discussions of this idea. Although you're raising an interesting variant because you're talking about a non- indexed field, so now I'm not sure those discussions are relevant. I don't know of anyone who has done what you're asking though... But if it's just stored data, you could go out to a database and pick it up at search time, although there are sound reasons for not requiring a database connection. What about having a separate index for just this one field? And make it an indexed value, along with some id (not the Lucene ID, probably) of your original. Something like index fields ID (unique ID for each document) field (the corresponding value). Searching this should be very fast, and if the usual Hits based search wasn't fast enough, perhaps something with termenum/termdocs would be faster. Or you could just index the unique ID and store (but not index) the field. Hits or variants should work for that too. So the general algorithm would be: search main index for each hit: search second index and fetch that field I have no idea whether this has any traction for your problem space, but I thought I'd mention it. This assumes that building the mutable index would be acceptably fast... Although conceptually, this is really just a Map of ID/value pairs. I have no idea how much data you're talking about, but if it's not a huge data set, might it be possible just to store it in a simple map and look it up that way? And if I'm all wet, I'm sure others will chime in... Best Erick * * On 3/14/07, Thomas K. Burkholder <[EMAIL PROTECTED]> wrote: Hi there, I'm using lucene to index and store entries from a database table for ultimate retrieval as search results. This works fine. But I find myself in the position of wanting to occasionally (daily-ish) bulk- update a single, stored, non-indexed field in every document in the index, without changing any indexed value at all. The obviously documented way to do this would be to remove and then re-add each updated document successively. However, I know from experience that rebuilding our index from scratch in this fashion would take several hours at least, which is too long to delay pending incremental index jobs. It seems to me that at some level it should be possible to iterate over all the document storage on disk and modify only the field I'm interested in (no index modification required remember as this is a field that is stored but not indexed). It's plain from the documentation on file formats that it would be potentially possible to do this from a low level, however before I go possibly re-inventing that wheel, I'm wondering if anyone knows of any existing code out there that would aid in solving this problem. Thanks in advance, //Thomas Thomas K. Burkholder Code Janitor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast index traversal and update for stored field?
Hey, thanks for the quick reply. I've considered using a secondary index just for this data but thought I would look at storing the data in lucene first, since ultimately this data gets transported to an outside system, and it's a lot easier if there's only one "thing" to transfer. The destination environment that receives this lucene index doesn't (and shouldn't) have access to the database, which is why we don't simply store it there. Even if it did, we try not to access the database for search results when we don't have to, as this tends to make searching slow (as I think you were alluding to). Sounds like there's nothing "out of the box" to solve my problem; if I write something to update lucene indexes in place I'll follow up about it in here (don't know that I will though; building a new, narrower index is probably more expedient and will probably be fast enough for my purposes in this case). Thanks again, //Thomas On Mar 14, 2007, at 4:50 PM, Erick Erickson wrote: If you search the mail archive for "update in place" (no quotes), you'll find extensive discussions of this idea. Although you're raising an interesting variant because you're talking about a non- indexed field, so now I'm not sure those discussions are relevant. I don't know of anyone who has done what you're asking though... But if it's just stored data, you could go out to a database and pick it up at search time, although there are sound reasons for not requiring a database connection. What about having a separate index for just this one field? And make it an indexed value, along with some id (not the Lucene ID, probably) of your original. Something like index fields ID (unique ID for each document) field (the corresponding value). Searching this should be very fast, and if the usual Hits based search wasn't fast enough, perhaps something with termenum/termdocs would be faster. Or you could just index the unique ID and store (but not index) the field. Hits or variants should work for that too. So the general algorithm would be: search main index for each hit: search second index and fetch that field I have no idea whether this has any traction for your problem space, but I thought I'd mention it. This assumes that building the mutable index would be acceptably fast... Although conceptually, this is really just a Map of ID/value pairs. I have no idea how much data you're talking about, but if it's not a huge data set, might it be possible just to store it in a simple map and look it up that way? And if I'm all wet, I'm sure others will chime in... Best Erick * * On 3/14/07, Thomas K. Burkholder <[EMAIL PROTECTED]> wrote: Hi there, I'm using lucene to index and store entries from a database table for ultimate retrieval as search results. This works fine. But I find myself in the position of wanting to occasionally (daily-ish) bulk- update a single, stored, non-indexed field in every document in the index, without changing any indexed value at all. The obviously documented way to do this would be to remove and then re-add each updated document successively. However, I know from experience that rebuilding our index from scratch in this fashion would take several hours at least, which is too long to delay pending incremental index jobs. It seems to me that at some level it should be possible to iterate over all the document storage on disk and modify only the field I'm interested in (no index modification required remember as this is a field that is stored but not indexed). It's plain from the documentation on file formats that it would be potentially possible to do this from a low level, however before I go possibly re-inventing that wheel, I'm wondering if anyone knows of any existing code out there that would aid in solving this problem. Thanks in advance, //Thomas Thomas K. Burkholder Code Janitor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how to get approximate total matching
If I remember correctly, I once searched over 40G of indexes using multi-searcher with 512M max heap size, how much memory did you give the JVM? Thanks, Xiaocheng senthil kumaran <[EMAIL PROTECTED]> wrote: Hi. I have more index directories (>6) all in GB,and searching my query with single IndexSearcher to all indexes one after another.i.e. I create one IndexSearcher for index1 and search over that.Finally I close that and create new IndexSearcher for index2 and so on. If i get 200 total results then i don't go to search other index directories and i print 200 results and exit from search. I need to get approximate total matching documents all over the indexes without going to search in other indexes. Please suggest me a easiest way to achieve this. P.S: To avoid more memory usage and to reduce search timeI don't want to search my query through all indexes if i got 200 results. MultiSearcher create OOM error, so that I'm using single IndexSearcher. Thanks in Advance Senthil - Don't get soaked. Take a quick peek at the forecast with theYahoo! Search weather shortcut.
Indexing HTML pages and phrases
Hi, I am wondering if we can index a phrase (not term) in Lucene? Also, I am not usre if it can index HTML pages? I need to have access to the text of some of tags, I am not sure if this can be done in Lucene. I would be so glad if you help me in this case. Thanks Expecting? Get great news right away with email Auto-Check. Try the Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Is Lucene Java trunk still stable for production code?
Hello Dear Lucene Users! Back in the old days (well, last year) the lucene/java/trunk subversion path was always stable enough for everyone to use into production code. Now, with the 2.0/2.1/2.2 braches, is it still the case? In December, I 'ported' my app to use the lucene 2.0 release. Now, I have another chance to upgrade the production code (this is not happening every month!) so I would like to upgrade the lucene library I'm using to take advantage of performance gains. Should I just update my svn image from lucene/java/trunk or should I take lucene/java/branches/lucene_2_1 Thanks! Jp
Re: Performance between Filter and HitCollector?
eks dev and others - have you tried using the code from LUCENE-584? Noticed any performance increase when you disabled scoring? I'd like to look at that patch soon and commit it if everything is in place and makes sense, so I'm curious if you or anyone else already tried this patch... Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: eks dev <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, March 14, 2007 3:59:25 PM Subject: Re: Performance between Filter and HitCollector? just to complete this fine answer, there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584) that could bring the best of both worlds via e.g. ConstantScoringQuery or another abstraction that enables disabling Scoring (where appropriate) - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 14 March, 2007 7:15:06 PM Subject: Re: Performance between Filter and HitCollector? it's kind of an Apples/Oranges comparison .. in the examples you gave below, one is executing an arbitrary query (which oculd be anything) the other is doing a simple TermEnumeration. Asuming that Query is a TermQuery, the Filter is theoreticaly going to be faster becuase it does't have to compute any Scores ... generally speaking a a Filter will alwyas be a little faster then a functionally equivilent Query for the purposes of building up a simple BitSet of matching documents because teh Query involves the score calcuations ... but the Query is generally more usable. The Query can also be more efficient in other ways, because the HitCollector doesn't *have* to build a BitSet, it can deal with the results in whatever way it wants (where as a Filter allways generates a BitSet). Solr goes the HitCollector route for a few reasons: 1) allows us to use hte DocSet abstraction which allows other performance benefits over straight BitSets 2) allows us to have simpler code that builds DocSets and DocLists (DocLists know about scores, sorting, and pagination) in a single pass when scores or sorting are requested. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ All New Yahoo! Mail Tired of unwanted email come-ons? Let our SpamGuard protect you. http://uk.docs.yahoo.com/nowyoucan.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance between Filter and HitCollector?
Thanks for the detailed reponse Hoss. That's the sort of in depth golden nugget I'd like to see in a copy of LIA 2 when it becomes available... I've wanted to use Filter to cache certain of my Term Queries, as it looked faster for straight Term Query searches, but Solr's DocSet interface abstraction is more useful. HashDocSet will probably satisfy 90% of my cache. Index DBs will typically be in the 1-3 million documents range, but for mail which is spread over 1-6K user, so caching lots of BitSets for that number of users in not practical! I ended up creating a DocSetFilter and creating DocSets (a la Solr) from BitSet which is then cached. I then convert it back during Filter.bits(). Not the best solution, but the typical hit size is small, so the iteration is fast. Thanks eks dev for the info about Lucene-584 - that looks like an interesting set of patches. Antony Chris Hostetter wrote: it's kind of an Apples/Oranges comparison .. in the examples you gave below, one is executing an arbitrary query (which oculd be anything) the other is doing a simple TermEnumeration. Asuming that Query is a TermQuery, the Filter is theoreticaly going to be faster becuase it does't have to compute any Scores ... generally speaking a a Filter will alwyas be a little faster then a functionally equivilent Query for the purposes of building up a simple BitSet of matching documents because teh Query involves the score calcuations ... but the Query is generally more usable. The Query can also be more efficient in other ways, because the HitCollector doesn't *have* to build a BitSet, it can deal with the results in whatever way it wants (where as a Filter allways generates a BitSet). Solr goes the HitCollector route for a few reasons: 1) allows us to use hte DocSet abstraction which allows other performance benefits over straight BitSets 2) allows us to have simpler code that builds DocSets and DocLists (DocLists know about scores, sorting, and pagination) in a single pass when scores or sorting are requested. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SpellChecker and Lucene 2.1
14 mar 2007 kl. 21.47 skrev Ryan O'Hara: Is there a SpellChecker.jar compatible with Lucene 2.1. After updating to Lucene 2.1, I seem to have lost the ability to create a spell index using spellchecker-2.0-rc1-dev.jar. Any help would be greatly appreciated. Can you explain the problem more detailed? Exceptions? API changes? -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance between Filter and HitCollector?
15 mar 2007 kl. 04.09 skrev Otis Gospodnetic: eks dev and others - have you tried using the code from LUCENE-584? Noticed any performance increase when you disabled scoring? I'd like to look at that patch soon and commit it if everything is in place and makes sense, so I'm curious if you or anyone else already tried this patch... I was trying out Matcher some month ago when fooling around with ways of improving speed in the "active search cache" of LUCENE-550. It worked just fine for me. I made no futher investigations, nor do I have any performance details. I plan to implement it in there for real any year now. So +1 for commit. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing HTML pages and phrases
- Original Message - From: "Maryam" <[EMAIL PROTECTED]> To: Sent: Thursday, March 15, 2007 7:55 AM Subject: Indexing HTML pages and phrases Hi, I am wondering if we can index a phrase (not term) in Lucene? Also, I am not usre if it can index HTML pages? I need to have access to the text of some of tags, I am not sure if this can be done in Lucene. I would be so glad if you help me in this case. Thanks Expecting? Get great news right away with email Auto-Check. Try the Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing HTML pages and phrases
Hi Maryam, You can index the content of specific field as UN_TOKENIZED and then you can do phrase search on that field.. It will search for only phrases not tokens... To index HTML pages you can use any HTML parser... this may be useful to you.. http://lucene.apache.org/java/docs/api/org/apache/lucene/demo/html/HTMLParser.html Thanks. Bhavin pandya - Original Message - From: "Maryam" <[EMAIL PROTECTED]> To: Sent: Thursday, March 15, 2007 7:55 AM Subject: Indexing HTML pages and phrases Hi, I am wondering if we can index a phrase (not term) in Lucene? Also, I am not usre if it can index HTML pages? I need to have access to the text of some of tags, I am not sure if this can be done in Lucene. I would be so glad if you help me in this case. Thanks Expecting? Get great news right away with email Auto-Check. Try the Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]