Re: Benchmarking my indexer
On Nov 1, 2008, at 1:39 AM, Rafael Cunha de Almeida wrote: Hello, I did an indexer that parses some files and indexes them using lucene. I want to benchmark the whole thing, so I'd like to count the tokens being indexed so I can calculate the average number of indexed tokens per second. Is there a way to count the number of tokens on a document? I think you would have to add a "CountingTokenFilter", that you write and manage as you add documents. Or, you could just take the total # of tokens / by the number of docs and use the average. That can be obtained w/o writing a new TokenFilter. While I'm at it, I will also need to calculate the amount of memory my java program used (peak, avg, etc), what java tool would you suggest me to figure that out? Would JConsole work: http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html help? I'm not sure what people use here - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exact Phrase Query
I was in a hurry when copying and pasting the code. What I've been using is only writer. RamWriter was never used as it never really worked (thanks to you, I now understand the reason). The above is not really related to the problem I was facing. I modified my code so that an indexreader/indexwriter is opened right before the words comparison takes place and is closed right after. (currently not using RamDir due to the problems faced earlier) Considering that the program is basically a loop that does thousands and thousands of comparison, this is definitely not the most efficient way of handling things. I would appreciate any input in this regard on how to improve the efficiency. --- On Sat, 11/1/08, Erick Erickson <[EMAIL PROTECTED]> wrote: > From: Erick Erickson <[EMAIL PROTECTED]> > Subject: Re: Exact Phrase Query > To: java-user@lucene.apache.org, [EMAIL PROTECTED] > Date: Saturday, November 1, 2008, 5:06 PM > a, finally. I'm almost completely sure you can't > *write* to a > RAMDirectory > and expect the underlying FSDir to be updated. The intent > of RAMDirectorys > is to *read* in an index from disk and keep it in memory. > Essentially I > believe > that your RAMDirecotry constructor is taking a snapshot of > the underlying > disk index, modifying that in-memory copy, and throwing it > away without > ever writing it to disk. I wouldn't expect opening the > FSDirectory after > writing > to the RAMDirectory to find anything. Ever. > > If you really need the RAMDir, I suspect you'll have to > open an FS-based > writer as well as a RAM-based writer, and write to both > when necessary. > You'll probably also have to open/search your RAM-based > index as the > faster alternative to re-opening the FS-based index. Either > way, reopening > the index is probably expensive, are you sure you need to? > Is there a way > to keep your information in an internal data structure for > some period of > time? > > Best > Erick > > > > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss > <[EMAIL PROTECTED]> wrote: > > > I am not entirely sure if this can be the cause, but > here is something I > > thought might be related: > > The idea is have an index containing documents where > each document has a > > combination of two words : word1 and word2 and a score > for these two words. > > The index would be searched first if the two words > exist, and if not the > > score would be computed on the fly and then added to > the index. This process > > would be repeated thousands of times for thousands of > words. > > > > Hence, I have an indexwriter and a searcher > > > > RAMDirectory ramDir = new RAMDirectory(INDEX_DIR); > > IndexWriter ramWriter = new IndexWriter(ramDir, new > WhitespaceAnalyzer(), > > true,IndexWriter.MaxFieldLength.UNLIMITED); > > writer = new IndexWriter(INDEX_DIR,new > WhitespaceAnalyzer(),true > > ,IndexWriter.MaxFieldLength.UNLIMITED); > > > > FSDirectory fsdir = > FSDirectory.getDirectory(INDEX_DIR); > > IndexReader ir = IndexReader.open(fsdir); > > _searcher = new IndexSearcher(ir); > > > > > > The indexWriter is closed near the end of the program > (it's open while > > searching for words combinations ). > > > > When using Luke,, I was able to search successfully > for exact phrases. My > > guess is that the problem I am facing has something to > do with the > > indexWriter, but I can not pinpoint the exact cause of > the problem. > > > > > > --- On Sat, 11/1/08, semelak ss > <[EMAIL PROTECTED]> wrote: > > > > > From: semelak ss <[EMAIL PROTECTED]> > > > Subject: Re: Exact Phrase Query > > > To: java-user@lucene.apache.org > > > Date: Saturday, November 1, 2008, 10:03 AM > > > When using Luke,, searching for the followings > gives me hits > > > now: > > > "insurer storm" > > > The synatx of the query as parsed by Luke is : > > > word:"insurer storm" > > > > > > The code I am using is as follows: > > > -- > > > _searcher = new IndexSearcher(INDEX_DIR); > > > _parser = new QueryParser("word", new > > > WhitespaceAnalyzer()); > > > Query q = _parser.parse(query); > > > System.out.println(q.toString()); // this outputs > -> > > > word:"insurer storm" > > > TopDocs vv= _searcher.search(q, 1); > > > Hits tmph = _searcher.search(q); > > > - > > > > > > both vv and tmph give no results (their size is > 0) > > > > > > > > > > > > --- On Fri, 10/31/08, semelak ss > > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: semelak ss > <[EMAIL PROTECTED]> > > > > Subject: Re: Exact Phrase Query > > > > To: java-user@lucene.apache.org > > > > Date: Friday, October 31, 2008, 9:41 AM > > > > For indexing, I use the following: > > > > === > > > > writer = new IndexWriter(INDEX_DIR,new > > > > WhitespaceAnalyzer(),true > > > > ,IndexWriter.MaxFieldLength.UNLIMITED); > > > > Document doc = new Document(); > > > > String tmpword = this.getProperForm(word1, > word2); > > > > doc.add(new Field("WORDS", > tmpwo
addDocument vs addIndexes
hi friends merge N document to an existing index is better than add N document to an existing index? in the other word, has IndexWriter.addIndexesNoOptimize less I/O than IndexWriter.addDocument? thanks
Re: Exact Phrase Query
Also, is there a way to pass a null or no tokenizer when writing to the index the field "words" ?? I have no need for tokenizing the words and the exact query will always be known. To understand better the problem, when are performing words comparison in large number of text documents. Each word in each sentence is compared with the rest of the words in the other sentences. A similarity score is computed for each pair and stored in the index for fast retrieval in the future (computation of the score is resource intensive). What we used to do is construct a matrix and store the words in alphabetical order (for binary search) and then load the words when the program is launched. Due to the size of the files generated, the update was a real struggle. Thus, we decided to use Lucene and store a score for each pair of words. Updates should be much easier and faster, however improving the search is something we're looking into. We are new to Lucene, and would appreciate any input in this regard. Knowing that the document would contain only two fields : score and words and that no tokenization is needed, what would be the most efficient way for implementing this index using Lucene ? --- On Sun, 11/2/08, semelak ss <[EMAIL PROTECTED]> wrote: > From: semelak ss <[EMAIL PROTECTED]> > Subject: Re: Exact Phrase Query > To: java-user@lucene.apache.org > Date: Sunday, November 2, 2008, 7:26 AM > I was in a hurry when copying and pasting the code. What > I've been using is only writer. RamWriter was never used > as it never really worked (thanks to you, I now understand > the reason). > > The above is not really related to the problem I was > facing. I modified my code so that an > indexreader/indexwriter is opened right before the words > comparison takes place and is closed right after. (currently > not using RamDir due to the problems faced earlier) > > Considering that the program is basically a loop that does > thousands and thousands of comparison, this is definitely > not the most efficient way of handling things. > > I would appreciate any input in this regard on how to > improve the efficiency. > > > > --- On Sat, 11/1/08, Erick Erickson > <[EMAIL PROTECTED]> wrote: > > > From: Erick Erickson <[EMAIL PROTECTED]> > > Subject: Re: Exact Phrase Query > > To: java-user@lucene.apache.org, [EMAIL PROTECTED] > > Date: Saturday, November 1, 2008, 5:06 PM > > a, finally. I'm almost completely sure you > can't > > *write* to a > > RAMDirectory > > and expect the underlying FSDir to be updated. The > intent > > of RAMDirectorys > > is to *read* in an index from disk and keep it in > memory. > > Essentially I > > believe > > that your RAMDirecotry constructor is taking a > snapshot of > > the underlying > > disk index, modifying that in-memory copy, and > throwing it > > away without > > ever writing it to disk. I wouldn't expect opening > the > > FSDirectory after > > writing > > to the RAMDirectory to find anything. Ever. > > > > If you really need the RAMDir, I suspect you'll > have to > > open an FS-based > > writer as well as a RAM-based writer, and write to > both > > when necessary. > > You'll probably also have to open/search your > RAM-based > > index as the > > faster alternative to re-opening the FS-based index. > Either > > way, reopening > > the index is probably expensive, are you sure you need > to? > > Is there a way > > to keep your information in an internal data structure > for > > some period of > > time? > > > > Best > > Erick > > > > > > > > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss > > <[EMAIL PROTECTED]> wrote: > > > > > I am not entirely sure if this can be the cause, > but > > here is something I > > > thought might be related: > > > The idea is have an index containing documents > where > > each document has a > > > combination of two words : word1 and word2 and a > score > > for these two words. > > > The index would be searched first if the two > words > > exist, and if not the > > > score would be computed on the fly and then added > to > > the index. This process > > > would be repeated thousands of times for > thousands of > > words. > > > > > > Hence, I have an indexwriter and a searcher > > > > > > RAMDirectory ramDir = new > RAMDirectory(INDEX_DIR); > > > IndexWriter ramWriter = new IndexWriter(ramDir, > new > > WhitespaceAnalyzer(), > > > true,IndexWriter.MaxFieldLength.UNLIMITED); > > > writer = new IndexWriter(INDEX_DIR,new > > WhitespaceAnalyzer(),true > > > ,IndexWriter.MaxFieldLength.UNLIMITED); > > > > > > FSDirectory fsdir = > > FSDirectory.getDirectory(INDEX_DIR); > > > IndexReader ir = IndexReader.open(fsdir); > > > _searcher = new IndexSearcher(ir); > > > > > > > > > The indexWriter is closed near the end of the > program > > (it's open while > > > searching for words combinations ). > > > > > > When using Luke,, I was able to search > successfully > > for exact phrases. My > > > guess is that th
Re: Exact Phrase Query
Sorry, but I've really run out of patience here. You have consistently stated only part of the problem, never posting enough information to allow me to answer helpfully. You haven't even taken the time to proofread your posts, which has wasted my (limited, volunteer) time. In the future, please consider the fact that people trying to help with your problem are volunteering their time and respect that fact by making a greater effort to make it easy and efficient for us to help with what is, after all, *your* problem. Best Erick On Sun, Nov 2, 2008 at 11:03 AM, semelak ss <[EMAIL PROTECTED]> wrote: > Also, is there a way to pass a null or no tokenizer when writing to the > index the field "words" ?? I have no need for tokenizing the words and the > exact query will always be known. > > To understand better the problem, when are performing words comparison in > large number of text documents. Each word in each sentence is compared with > the rest of the words in the other sentences. A similarity score is computed > for each pair and stored in the index for fast retrieval in the future > (computation of the score is resource intensive). What we used to do is > construct a matrix and store the words in alphabetical order (for binary > search) and then load the words when the program is launched. Due to the > size of the files generated, the update was a real struggle. > > Thus, we decided to use Lucene and store a score for each pair of words. > Updates should be much easier and faster, however improving the search is > something we're looking into. We are new to Lucene, and would appreciate any > input in this regard. > > Knowing that the document would contain only two fields : score and words > and that no tokenization is needed, what would be the most efficient way for > implementing this index using Lucene ? > > > --- On Sun, 11/2/08, semelak ss <[EMAIL PROTECTED]> wrote: > > > From: semelak ss <[EMAIL PROTECTED]> > > Subject: Re: Exact Phrase Query > > To: java-user@lucene.apache.org > > Date: Sunday, November 2, 2008, 7:26 AM > > I was in a hurry when copying and pasting the code. What > > I've been using is only writer. RamWriter was never used > > as it never really worked (thanks to you, I now understand > > the reason). > > > > The above is not really related to the problem I was > > facing. I modified my code so that an > > indexreader/indexwriter is opened right before the words > > comparison takes place and is closed right after. (currently > > not using RamDir due to the problems faced earlier) > > > > Considering that the program is basically a loop that does > > thousands and thousands of comparison, this is definitely > > not the most efficient way of handling things. > > > > I would appreciate any input in this regard on how to > > improve the efficiency. > > > > > > > > --- On Sat, 11/1/08, Erick Erickson > > <[EMAIL PROTECTED]> wrote: > > > > > From: Erick Erickson <[EMAIL PROTECTED]> > > > Subject: Re: Exact Phrase Query > > > To: java-user@lucene.apache.org, [EMAIL PROTECTED] > > > Date: Saturday, November 1, 2008, 5:06 PM > > > a, finally. I'm almost completely sure you > > can't > > > *write* to a > > > RAMDirectory > > > and expect the underlying FSDir to be updated. The > > intent > > > of RAMDirectorys > > > is to *read* in an index from disk and keep it in > > memory. > > > Essentially I > > > believe > > > that your RAMDirecotry constructor is taking a > > snapshot of > > > the underlying > > > disk index, modifying that in-memory copy, and > > throwing it > > > away without > > > ever writing it to disk. I wouldn't expect opening > > the > > > FSDirectory after > > > writing > > > to the RAMDirectory to find anything. Ever. > > > > > > If you really need the RAMDir, I suspect you'll > > have to > > > open an FS-based > > > writer as well as a RAM-based writer, and write to > > both > > > when necessary. > > > You'll probably also have to open/search your > > RAM-based > > > index as the > > > faster alternative to re-opening the FS-based index. > > Either > > > way, reopening > > > the index is probably expensive, are you sure you need > > to? > > > Is there a way > > > to keep your information in an internal data structure > > for > > > some period of > > > time? > > > > > > Best > > > Erick > > > > > > > > > > > > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss > > > <[EMAIL PROTECTED]> wrote: > > > > > > > I am not entirely sure if this can be the cause, > > but > > > here is something I > > > > thought might be related: > > > > The idea is have an index containing documents > > where > > > each document has a > > > > combination of two words : word1 and word2 and a > > score > > > for these two words. > > > > The index would be searched first if the two > > words > > > exist, and if not the > > > > score would be computed on the fly and then added > > to > > > the index. This process > > > > would be repeated thousands of times for > > thousands of > > > words. > > > > >
Re: Exact Phrase Query
Hello Erick, If it weren't for your help and kind response, I would be struggling now with the initial problem I had. The solution to that problem turned out to be the one you mentioned in your response (indexwriters/indexreaders both being opened at the same time). The problem I mentioned in my last response is different from the initial question I posted. It's really a request for thoughts and people inputs on how to improve searching given the structure of the data described in my last response. Again, I appreciate your help (and I am not saying this because I am looking forward to your response.) --- On Sun, 11/2/08, Erick Erickson <[EMAIL PROTECTED]> wrote: > From: Erick Erickson <[EMAIL PROTECTED]> > Subject: Re: Exact Phrase Query > To: java-user@lucene.apache.org, [EMAIL PROTECTED] > Date: Sunday, November 2, 2008, 12:11 PM > Sorry, but I've really run out of patience here. You > have consistently > stated only > part of the problem, never posting enough information to > allow me to answer > helpfully. You haven't even taken the time to proofread > your posts, which > has wasted my (limited, volunteer) time. > > In the future, please consider the fact that people trying > to help with your > > problem are volunteering their time and respect that fact > by making a > greater effort to make it easy and efficient for us to help > with what is, > after all, *your* problem. > > Best > Erick > > On Sun, Nov 2, 2008 at 11:03 AM, semelak ss > <[EMAIL PROTECTED]> wrote: > > > Also, is there a way to pass a null or no tokenizer > when writing to the > > index the field "words" ?? I have no need > for tokenizing the words and the > > exact query will always be known. > > > > To understand better the problem, when are performing > words comparison in > > large number of text documents. Each word in each > sentence is compared with > > the rest of the words in the other sentences. A > similarity score is computed > > for each pair and stored in the index for fast > retrieval in the future > > (computation of the score is resource intensive). What > we used to do is > > construct a matrix and store the words in alphabetical > order (for binary > > search) and then load the words when the program is > launched. Due to the > > size of the files generated, the update was a real > struggle. > > > > Thus, we decided to use Lucene and store a score for > each pair of words. > > Updates should be much easier and faster, however > improving the search is > > something we're looking into. We are new to > Lucene, and would appreciate any > > input in this regard. > > > > Knowing that the document would contain only two > fields : score and words > > and that no tokenization is needed, what would be the > most efficient way for > > implementing this index using Lucene ? > > > > > > --- On Sun, 11/2/08, semelak ss > <[EMAIL PROTECTED]> wrote: > > > > > From: semelak ss <[EMAIL PROTECTED]> > > > Subject: Re: Exact Phrase Query > > > To: java-user@lucene.apache.org > > > Date: Sunday, November 2, 2008, 7:26 AM > > > I was in a hurry when copying and pasting the > code. What > > > I've been using is only writer. RamWriter was > never used > > > as it never really worked (thanks to you, I now > understand > > > the reason). > > > > > > The above is not really related to the problem I > was > > > facing. I modified my code so that an > > > indexreader/indexwriter is opened right before > the words > > > comparison takes place and is closed right after. > (currently > > > not using RamDir due to the problems faced > earlier) > > > > > > Considering that the program is basically a loop > that does > > > thousands and thousands of comparison, this is > definitely > > > not the most efficient way of handling things. > > > > > > I would appreciate any input in this regard on > how to > > > improve the efficiency. > > > > > > > > > > > > --- On Sat, 11/1/08, Erick Erickson > > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Erick Erickson > <[EMAIL PROTECTED]> > > > > Subject: Re: Exact Phrase Query > > > > To: java-user@lucene.apache.org, > [EMAIL PROTECTED] > > > > Date: Saturday, November 1, 2008, 5:06 PM > > > > a, finally. I'm almost completely > sure you > > > can't > > > > *write* to a > > > > RAMDirectory > > > > and expect the underlying FSDir to be > updated. The > > > intent > > > > of RAMDirectorys > > > > is to *read* in an index from disk and keep > it in > > > memory. > > > > Essentially I > > > > believe > > > > that your RAMDirecotry constructor is taking > a > > > snapshot of > > > > the underlying > > > > disk index, modifying that in-memory copy, > and > > > throwing it > > > > away without > > > > ever writing it to disk. I wouldn't > expect opening > > > the > > > > FSDirectory after > > > > writing > > > > to the RAMDirectory to find anything. Ever. > > > > > > > > If you really need the RAMDir, I suspect > you'll > > > have to > > > > open an FS-based > > > > writer as well as a RAM-base
Re: Benchmarking my indexer
On Sun, 2 Nov 2008 07:11:20 -0500 Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Nov 1, 2008, at 1:39 AM, Rafael Cunha de Almeida wrote: > > > Hello, > > > > I did an indexer that parses some files and indexes them using > > lucene. I > > want to benchmark the whole thing, so I'd like to count the tokens > > being indexed so I can calculate the average number of indexed tokens > > per second. Is there a way to count the number of tokens on a > > document? > > I think you would have to add a "CountingTokenFilter", that you write > and manage as you add documents. Or, you could just take the total # > of tokens / by the number of docs and use the average. That can be > obtained w/o writing a new TokenFilter. How would I obtain the total number of tokens on an index? I couldn't find that statistic anywhere. I looked for it on IndexWritter, IndexReader and IndexSearcher classes. Is there maybe some tool I'd run on a index or something like that? > > > > > While I'm at it, I will also need to calculate the amount of memory my > > java program used (peak, avg, etc), what java tool would you suggest > > me > > to figure that out? > > > Would JConsole work: > http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html >help? I'm not sure what people use here > Will look into it :-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching over multiple fields using XML document
Dear fellow Java/Lucene developers: I am trying to search an xml document over multiple fields. The index I created using the SAX method. I am trying to search shakespeare's "Hamlet" over the and tags for words that the user is looking for. I am thinking of using the MultiFieldQueryParser however, I also read that a better alternative would be to combine the various fields together. In the book, "Lucene in Action", the author writes: "A synthetic 'contents' field in our test environment uses this scheme to put author and subjects together: doc.add(Field.UnStored("contents", author + " " + subjects)); We used a space (" ") between author and subjects to separate words for the analyzer." I am not sure I fully understand what the author is referring to here. In my situation I have the following: public void endElement(String uri, String localName, String qName) throws SAXException{ try { if(qName.equals("REFERENCE")){ Field reference = new Field(qName, elementBuffer.toString(), Field.Store.YES, Field.Index.NO, Field.TermVector.NO); doc.add(reference); } else if(qName.equals("SPEAKER")){ Field speaker = new Field(qName, elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES); speaker.setBoost(2.0f); doc.add(speaker); } else if(qName.equals("LINES")){ Field lines = new Field(qName, elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES); lines.setBoost(1.0f); doc.add(lines); indexWriter.addDocument(doc); } else{ return; } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } How would I combine the fields together into one synthetic field here so that in my searcher code, I would search over one field, yet retrieve the results from the several fields that the keyword is found and show that to the user? All I want to do, is allow a user to search an xml document over multiple fields and return the results with the keywords they are searching for, highlighted in the results list, just as google does when searching for websites. At this point, I am able to do a simple/fuzzy/wildcard search over one field in the xml document, but would like to extend this functionality over multiple fields. Any ideas? Thanks in advance to all who reply. Sincerely; Fayyaz -- View this message in context: http://www.nabble.com/Searching-over-multiple-fields-using-XML-document-tp20295306p20295306.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Performance of never optimizing
Howdy, I have a couple of questions regarding some Lucene benchmarking and what the results mean[3]. (Skip to the numbered list at the end if you don't want to read the lengthy exegesis :) I'm a developer for JIRA[1]. We are currently trying to get a better understanding of Lucene, and our use of it, to cope with the needs of our larger customers. These "large" indexes are only a couple hundred thousand documents but our problem is compounded by the fact that they have a relatively high rate of modification (=delete+insert of new document) and our users expect these modification to show up in query results pretty much instantly. Our current default behaviour is a merge factor of 4. We perform an optimization on the index every 4000 additions. We also perform an optimize at midnight. Our fundamental problem is that these optimizations are locking the index for unacceptably long periods of time, something that we want to resolve for our next major release, hopefully without undermining search performance too badly. In the Lucene javadoc there is a comment, and a link to a mailing list discussion[2], that suggests applications such as JIRA should never perform optimize but should instead set their merge factor very low. In an attempt to understand the impact of a) lowering the merge factor from 4 to 2 and b) never, ever optimizing on an index (over the course of years and millions of additions/updates) I wanted to try to benchmark Lucene. I used the contrib/benchmark framework and wrote a small algorithm that adds documents to an index (using the Reuters doc generator), does a search, does an optimize, then does another search. All the pretty pictures can be seen at: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs I have several questions, hopefully they aren't overwhelming in their quantity :-/ 1. Why does the merge factor of 4 appear to be faster than the merge factor of 2? 2. Why does non-optimized searching appear to be faster than optimized searching once the index hits ~500,000 documents? 3. There appears to be a fairly sizable performance drop across the board around 450,000 documents. Why is that? 4. Searching performance appears to decrease towards a fairly pessimistic 20 searches per second (for a relatively simple search). Is this really what we should expect long-term from Lucene? 5. Does my benchmark even make sense? I am far from an expert on benchmarking so it is possible I'm not measuring what I think I am measuring. Thanks in advance for any insight you can provide. This is an area that we very much want to understand better as Lucene is a key part of JIRA's success, Cheers, Justus JIRA Developer [1]: http://www.atlassian.com [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895 [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance of never optimizing
Hello, Very quick comments. - Original Message > From: Justus Pendleton <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Sunday, November 2, 2008 10:42:52 PM > Subject: Performance of never optimizing > > Howdy, > > I have a couple of questions regarding some Lucene benchmarking and what the > results mean[3]. (Skip to the numbered list at the end if you don't want to > read > the lengthy exegesis :) > > I'm a developer for JIRA[1]. We are currently trying to get a better > understanding of Lucene, and our use of it, to cope with the needs of our > larger > customers. These "large" indexes are only a couple hundred thousand documents > but our problem is compounded by the fact that they have a relatively high > rate > of modification (=delete+insert of new document) and our users expect these > modification to show up in query results pretty much instantly. This will be a tough call with large indices - there is no real-time search in Lucene yet. > Our current default behaviour is a merge factor of 4. We perform an > optimization > on the index every 4000 additions. We also perform an optimize at midnight. > Our I wouldn't optimize every 4000 additions - you are killing IO, rewriting the whole index, while trying to provide fast searches, plus you are locking the index for other modifications. > fundamental problem is that these optimizations are locking the index for > unacceptably long periods of time, something that we want to resolve for our > next major release, hopefully without undermining search performance too > badly. Why are you optimizing? Trying to make the search faster? I would try to avoid optimizing during high usage periods. > In the Lucene javadoc there is a comment, and a link to a mailing list > discussion[2], that suggests applications such as JIRA should never perform > optimize but should instead set their merge factor very low. Right, you can let Lucene merge segments. > In an attempt to understand the impact of a) lowering the merge factor from 4 > to > 2 and b) never, ever optimizing on an index (over the course of years and > millions of additions/updates) I wanted to try to benchmark Lucene. One thing that you might not have tried is the constant re-opening of the IndexReader, which you'll need to do if you want to see index changes instantly. > I used the contrib/benchmark framework and wrote a small algorithm that adds > documents to an index (using the Reuters doc generator), does a search, does > an > optimize, then does another search. All the pretty pictures can be seen at: So you indexed once and then measured search performance? Or did you measure indexing performance? I can't quite tell from your email. And in one case you optimized before searching and in the other you did not optimize? > http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs > > I have several questions, hopefully they aren't overwhelming in their > quantity > :-/ > > 1. Why does the merge factor of 4 appear to be faster than the merge factor > of > 2? Faster for indexing or searching? If indexing, then it's because 4 means fewer segment merges than 2. If searching, then I don't know, unless you had indexing and searching happening in parallel, which then means less IO for 4. Did you index fit in RAM, by the way? > 2. Why does non-optimized searching appear to be faster than optimized > searching > once the index hits ~500,000 documents? Not sure without seeing the index/machine. It sounds like you were measuring search performance while at the same time increasing the index size by incrementally adding more docs? > 3. There appears to be a fairly sizable performance drop across the board > around > 450,000 documents. Why is that? Something to do with Lucene merging index segments around that point? At this point I'm assuming you were measuring search speed while indexing. > 4. Searching performance appears to decrease towards a fairly pessimistic 20 > searches per second (for a relatively simple search). Is this really what we > should expect long-term from Lucene? 20 reqs/sec sounds very low. How large is your index, how much RAM, and how about heap size? What were your queries like? random? from log? > 5. Does my benchmark even make sense? I am far from an expert on benchmarking > so > it is possible I'm not measuring what I think I am measuring. I'm confused by what exactly you did and measured, but it could just be that I'm tired. > Thanks in advance for any insight you can provide. This is an area that we > very > much want to understand better as Lucene is a key part of JIRA's success, > > [1]: http://www.atlassian.com > [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895 > [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch --
Re: Performance of never optimizing
On 03/11/2008, at 4:27 PM, Otis Gospodnetic wrote: Why are you optimizing? Trying to make the search faster? I would try to avoid optimizing during high usage periods. I assume that the original, long-ago, decision to optimize was made to improve searching performance. One thing that you might not have tried is the constant re-opening of the IndexReader, which you'll need to do if you want to see index changes instantly. We do keep track of when the index has been updated and re-open IndexReaders so that they see the updates instantly. So you indexed once and then measured search performance? Or did you measure indexing performance? I can't quite tell from your email. And in one case you optimized before searching and in the other you did not optimize? Yes, I indexed once and then measured search performance. (The actual algorithm used can be seen at http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs) For my current purposes I don't care about indexing performance. 1. Why does the merge factor of 4 appear to be faster than the merge factor of 2? Faster for indexing or searching? If indexing, then it's because 4 means fewer segment merges than 2. If searching, then I don't know, unless you had indexing and searching happening in parallel, which then means less IO for 4. For searching. The index and search should not have been happening in parallel. However, multiple searches are occurring in parallel. Did you index fit in RAM, by the way? The machine has, I believe, 4 GB of RAM and the benchmark suite reports than 700 MB were used, so it does appear to have fit into RAM. 2. Why does non-optimized searching appear to be faster than optimized searching once the index hits ~500,000 documents? Not sure without seeing the index/machine. The machine is an 8-core Mac Pro. If you'd like, I can provide the indexes online somewhere. Or if you can provide pointers on what to look for, I'm more than happy to investigate this myself. It sounds like you were measuring search performance while at the same time increasing the index size by incrementally adding more docs? No documents were being added to the index while the searching was being performed. I was trying to measure only the search performance. 20 reqs/sec sounds very low. How large is your index, how much RAM, and how about heap size? What were your queries like? random? from log? The queries were generated by the ReutersQueryMaker. I am not sure what the heap size used as various stages were. (I ran the benchmarks over the weekend; they took several days.) I'm confused by what exactly you did and measured, but it could just be that I'm tired. My apologies for not being clearer in my initial email. I appreciate the help, Cheers, Justus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance of never optimizing
Hi, Justus, I had met with very similar problems as JIRA has, which has high modification and on a large data volume. It's a pretty common use case for Lucene. The way I dealt with high rate of modification is to create a secondary in-memory index. And only persist documents older than a period of time. So searching will need to combine results from two indexes. It's a bit complicated when creating the index, but it's worth well to save the extra IO-heavy merging and to improve response time, especially the ability to search right away with just added documents. BTW: JIRA is great! -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! Justus Pendleton wrote: Howdy, I have a couple of questions regarding some Lucene benchmarking and what the results mean[3]. (Skip to the numbered list at the end if you don't want to read the lengthy exegesis :) I'm a developer for JIRA[1]. We are currently trying to get a better understanding of Lucene, and our use of it, to cope with the needs of our larger customers. These "large" indexes are only a couple hundred thousand documents but our problem is compounded by the fact that they have a relatively high rate of modification (=delete+insert of new document) and our users expect these modification to show up in query results pretty much instantly. Our current default behaviour is a merge factor of 4. We perform an optimization on the index every 4000 additions. We also perform an optimize at midnight. Our fundamental problem is that these optimizations are locking the index for unacceptably long periods of time, something that we want to resolve for our next major release, hopefully without undermining search performance too badly. In the Lucene javadoc there is a comment, and a link to a mailing list discussion[2], that suggests applications such as JIRA should never perform optimize but should instead set their merge factor very low. In an attempt to understand the impact of a) lowering the merge factor from 4 to 2 and b) never, ever optimizing on an index (over the course of years and millions of additions/updates) I wanted to try to benchmark Lucene. I used the contrib/benchmark framework and wrote a small algorithm that adds documents to an index (using the Reuters doc generator), does a search, does an optimize, then does another search. All the pretty pictures can be seen at: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs I have several questions, hopefully they aren't overwhelming in their quantity :-/ 1. Why does the merge factor of 4 appear to be faster than the merge factor of 2? 2. Why does non-optimized searching appear to be faster than optimized searching once the index hits ~500,000 documents? 3. There appears to be a fairly sizable performance drop across the board around 450,000 documents. Why is that? 4. Searching performance appears to decrease towards a fairly pessimistic 20 searches per second (for a relatively simple search). Is this really what we should expect long-term from Lucene? 5. Does my benchmark even make sense? I am far from an expert on benchmarking so it is possible I'm not measuring what I think I am measuring. Thanks in advance for any insight you can provide. This is an area that we very much want to understand better as Lucene is a key part of JIRA's success, Cheers, Justus JIRA Developer [1]: http://www.atlassian.com [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895 [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]