Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
I am not disputing that there is a speed improvement. I am disputing that the performance gain of many of these patches is not worth the additional complexity in the code. Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms, not just micro improvements that the JVM (eventually) can probably figure out on its own. It is a value judgement, and regretfully I don't have another 30 years to pass down the full knowledge behind my reasoning. Luckily, however, there are some very good books available on the subject... It's not the fault of the submitter, but many of these timings are suspect due to difficulty in measuring the improvements accurately. Here is a simple example: You can configure the JVM to not perform aggressive garbage collection, and write a program that generates a lot garbage - but it runs very fast (not GCing), until the GC eventually occurs (if the program runs long enough). It may be overall much slower than an alternative that runs slower as it executes, but has code to manage the objects as they are created, and rarely if ever hits a GC cycle. But then, the JVM (e.g. generational GC) can implement improvements that makes choice A faster (and the better choice)... and the cycle continues... Without detailed timings and other metrics (GC pauses, IO, memory utilization, native compilation, etc.) most benchmarks are not very accurate or useful. There are a lot of variables to consider - maybe more so than can reasonably be considered. That is why a 4% gain is highly suspect. If the gain was 25%, or 50% or 100%, you have a better chance of it being an innate improvement, and not just the interaction of some other factors. On Feb 11, 2008, at 2:32 AM, eks dev wrote: Robert, you may or may not be right, I do not know. The only way to prove it would be to show you can do it better, no? If you are so convinced this is wrong, you could, much better than quoting textbooks: a) write better patch, get attention with something you think is "better bottleneck" b) provide realistic "performance tests" as you dispute the measurement provided here It has to be that concrete, academic discussions are cool, but at the end of a day, it is the code that executes that counts. cheers, eks - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Sunday, 10 February, 2008 9:15:30 PM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not sure these numbers matter. I think they are skewed because you are probably running too short a test, and the index is in memory (or OS cache). Once you use a real index that needs to read/write from the disk, the percentage change will be negligible. This is the problem with many of these "performance changes" - they just aren't real world enough. Even if they were, I would argue that code simplicity/maintainability is worth more than 6 seconds on a operation that takes 4 minutes to run... There are many people that believe micro benchmarks are next to worthless. A good rule of thumb is that if the optimization doesn't result in 2x speedup, it probably shouldn't be done. In most cases any efficiency gains are later lost in maintainability issues. See http://en.wikipedia.org/wiki/Optimization_(computer_science) Almost always there is a better bottleneck somewhere. On Feb 10, 2008, at 1:37 PM, Michael McCandless wrote: Yonik Seeley wrote: I wonder how well a single generic quickSort(Object[] arr, int low, int high) would perform vs the type-specific ones? I guess the main overhead would be a cast from Object to the specific class to do the compare? Too bad Java doesn't have true generics/templates. OK I tested this. Starting from the patch on LUCENE-1172, which has 3 quickSort methods (one per type), I created a single quickSort method on Object[] that takes a Comparator, and made 3 Comparators instead. Mac OS X 10.4 (JVM 1.5): original patch --> 247.1 simplified patch --> 254.9 (3.2% slower) Windows Server 2003 R64 (JVM 1.6): original patch --> 440.6 simplified patch --> 452.7 (2.7% slower) The times are best in 10 runs. I'm running all tests with these JVM args: -Xms1024M -Xmx1024M -Xbatch -server I think this is a big enough difference in performance that it's worth keeping 3 separate quickSorts in DocumentsWriter. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com -
[jira] Resolved: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-325. --- Resolution: Fixed I just committed this. Thanks John! And sorry for the long delay. I also added an "these APIs are experimental" warning on top of MergePolicy and MergeScheduler (which I should have done before 2.3 :(, though I don't expect alot of usage of these). > [PATCH] new method expungeDeleted() added to IndexWriter > > > Key: LUCENE-325 > URL: https://issues.apache.org/jira/browse/LUCENE-325 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Windows XP > Platform: All >Reporter: John Wang >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, > LUCENE-325.patch, TestExpungeDeleted.java > > > We make use the docIDs in lucene. I need a way to compact the docIDs in > segments > to remove the "holes" created from doing deletes. The only way to do this is > by > calling IndexWriter.optimize(). This is a very heavy call, for the cases where > the index is large but with very small number of deleted docs, calling > optimize > is not practical. > I need a new method: expungeDeleted(), which finds all the segments that have > delete documents and merge only those segments. > I have implemented this method and have discussed with Otis about submitting a > patch. I don't see where I can attached the patch. I will do according to the > patch guidleine and email the lucene mailing list. > Thanks > -John > I don't see a place where I can -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567849#action_12567849 ] Michael McCandless commented on LUCENE-1173: Yes this is one awesome test case :) Thanks. > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Assignee: Michael McCandless >Priority: Critical > Attachments: indexstress.patch, indexstress.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-1173: - Attachment: indexstress.patch Thanks Mike! Attaching new version of test that correctly deals with terms with no docs (because of deletions). Other variations were failing before, now it's just those with autoCommit=false Note that it's possible to trigger this bug by indexing only 3 documents: mergeFactor=2; maxBufferedDocs=2; Map docs = indexRandom(1, 3, 2, dir1); I love random testing:-) > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Assignee: Michael McCandless >Priority: Critical > Attachments: indexstress.patch, indexstress.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1174) outdated information in Analyzer javadoc
[ https://issues.apache.org/jira/browse/LUCENE-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Naber updated LUCENE-1174: - Attachment: analyzer-javadoc.diff > outdated information in Analyzer javadoc > > > Key: LUCENE-1174 > URL: https://issues.apache.org/jira/browse/LUCENE-1174 > Project: Lucene - Java > Issue Type: Bug > Components: Javadocs >Affects Versions: 2.3 >Reporter: Daniel Naber >Priority: Minor > Attachments: analyzer-javadoc.diff > > > I'm sure you find more ways to improve the javadoc, so feel free to change > and extend my patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1174) outdated information in Analyzer javadoc
outdated information in Analyzer javadoc Key: LUCENE-1174 URL: https://issues.apache.org/jira/browse/LUCENE-1174 Project: Lucene - Java Issue Type: Bug Components: Javadocs Affects Versions: 2.3 Reporter: Daniel Naber Priority: Minor Attachments: analyzer-javadoc.diff I'm sure you find more ways to improve the javadoc, so feel free to change and extend my patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567819#action_12567819 ] Michael McCandless commented on LUCENE-1173: Uh oh ... I'll take this! > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Priority: Critical > Attachments: indexstress.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567818#action_12567818 ] Yonik Seeley commented on LUCENE-1173: -- Note: if I reduce the test to indexing with a single thread, it still fails. Map docs = indexRandom(1, 50, 50, dir1); The test still does the indexing in a different thread than the close(), so it's not quite a single threaded test. Another thing to note: all of the terms are matching up (the test succeeds if I don't test the stored fields). > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Priority: Critical > Attachments: indexstress.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-1173: - Attachment: indexstress.patch Attaching a patch that can reproduce. With autoCommit=true, the test passes. Change it to false and it fails. The test basic uses multiple threads to update documents. The last document for any id is kept, and then all these docs are indexed again serially. The two indicies are them compared. > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Priority: Critical > Attachments: indexstress.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1173) index corruption autoCommit=false
index corruption autoCommit=false - Key: LUCENE-1173 URL: https://issues.apache.org/jira/browse/LUCENE-1173 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3 Reporter: Yonik Seeley Priority: Critical In both Lucene 2.3 and trunk, the index becomes corrupted when autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Grant Ingersoll wrote: Also, perhaps we should spin off another thread to discuss how to make DocsWriter easier to maintain. My biggest concern is understanding how the various threads work together, and a few other areas but, like I said, let's spin up a separate thread to brainstorm what is needed. I agree we should work on simplifying it with time, and, spreading the knowledge of how it works. Note, that there is some risk in just using wikipedia for profiling given it's distribution of terms, etc.. Good point. Previously I was using Europarl, but, that corpus is just too fast to index. Are you thinking Wikipedia is somewhat "dirty" (lots of extra terms not normally seen with clean content)? Since I'm using StandardAnalyzer and not an analyzer based on the new WikipediaTokenizer, I'm getting even extra terms. Also, I think we'd need an HTMLFilter in the chain since Wikipedia content uses HTML markup. Grant, what analyzer chain do you use when you index Wikipedia? I also wonder if using the LineDocMaker is all that realistic a profiling scenario. While it is really useful in that it minimizes IO interaction, etc. I can't help but feel that it isn't at all close to typical usage. Most users are not going to have all their docs rolled up into a single file, 1 doc per line, so I wonder if we potentially lose insight into how Lucene performs given that other issues like I/O/memory used for loading files may force the JVM/Lucene to not have the resources it needs. Of course, I do know it is good to try to isolate things so we can focus just on Lucene, but we also should try to make some accounting for how it lives in the wild. I agree, this part is not realistic, and the intention is to measure just the indexing time. In fact I expect most apps spend quite a bit more time building up a Document (filtering binary docs, etc) than actually indexing it. The only real-world app that I can think of that would be close to LineDocMaker is using Lucene to search big log files, where one line = one Document. Last, I think it would be good to always attach/check in the .alg file that is used when running the test, so that others can verify on different systems/configurations, etc. I did post the alg (under LUCENE-1172). Though I see I forgot to {code} it and it looks messed up now. My recent test to try a single quickSort(Object[]) were the same alg, just repeated 10 times instead of 3. But I agree we should always post the alg for all tests... Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1044. Resolution: Fixed > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch, > LUCENE-1044.take5.patch, LUCENE-1044.take6.patch, LUCENE-1044.take7.patch, > LUCENE-1044.take8.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
OK, I am convinced that this one is useful. Also, perhaps we should spin off another thread to discuss how to make DocsWriter easier to maintain. My biggest concern is understanding how the various threads work together, and a few other areas but, like I said, let's spin up a separate thread to brainstorm what is needed. Note, that there is some risk in just using wikipedia for profiling given it's distribution of terms, etc.. I also wonder if using the LineDocMaker is all that realistic a profiling scenario. While it is really useful in that it minimizes IO interaction, etc. I can't help but feel that it isn't at all close to typical usage. Most users are not going to have all their docs rolled up into a single file, 1 doc per line, so I wonder if we potentially lose insight into how Lucene performs given that other issues like I/O/memory used for loading files may force the JVM/Lucene to not have the resources it needs. Of course, I do know it is good to try to isolate things so we can focus just on Lucene, but we also should try to make some accounting for how it lives in the wild. Last, I think it would be good to always attach/check in the .alg file that is used when running the test, so that others can verify on different systems/configurations, etc. -Grant On Feb 11, 2008, at 6:14 AM, Michael McCandless wrote: In fact I've found you need to pursue both the 2x type gains and also the many smaller ones, to reach good performance. And it requires alot of ongoing vigilence to keep good performance. You lose 3-4% here and there and very quickly, very easily you're 2X slower. These tests are very real. I'm indexing Wikipedia content, using StandardAnalyzer, running under contrib/benchmark. It's true, in a real app more time will be spent pulling documents from the source, but I'm intentionally trying to minimize that in order to measure just the indexing time. Getting a 4% gain by replacing mergesort with quicksort is real. If the profiler found other 4% gains, with such a small increase in code complexity, I would passionately argue for those as well. So far it hasn't. Robert if you have some concrete ideas for the 2X type gains, I'm all ears :) I certainly agree there is a point where complexity cost doesn't offset the performance gain, but I think this particular change is well before that point. Lucene's indexing throughput is an important metric in its competitiveness with other search engines. And I want Lucene to be the best. Mike eks dev wrote: again, as long as you do not make one step forward into actual code, we will continue to have what we have today, as this is the best what we have. you made your statement: "Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms", Not a single person here would ever dispute this statement, but unfortunately there is no compiler that executes such statements. Make a patch that utilizes this "clear-code" paradigm, show us these better algorithms on actual example and than say: "without LUCENE-1172 I was able to improve XYZ feature by using ABC algorithm". That would work smooth. Anyhow, I am not going to write more on this topic, sorry for the noise... And Robert, please do not get this wrong, I see your point and I respect it! I just felt slight unfairness to the people that make the hands dirty writing as clear and fast code as possible. - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Monday, 11 February, 2008 9:55:02 AM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not disputing that there is a speed improvement. I am disputing that the performance gain of many of these patches is not worth the additional complexity in the code. Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms, not just micro improvements that the JVM (eventually) can probably figure out on its own. It is a value judgement, and regretfully I don't have another 30 years to pass down the full knowledge behind my reasoning. Luckily, however, there are some very good books available on the subject... It's not the fault of the submitter, but many of these timings are suspect due to difficulty in measuring the improvements accurately. Here is a simple example: You can configure the JVM to not perform aggressive garbage collection, and write a program that generates a lot garbage - but it runs very fast (not GCing), until the GC eventually occurs (if the program runs long enough). It may be overall much slower than an alternative that runs slower as it executes, but has code to manage the objects as they are created, and rarely if ever hits a GC cycle. But then, the JVM (e.g. generational GC) can implement imp
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Michael McCandless wrote: In fact I've found you need to pursue both the 2x type gains and also the many smaller ones, to reach good performance. +1 Put another way, you must address both the asymptotic behavior and the constant factors. A good order-of-algorithms implementation is worthless if its constant factors are huge, and vice-versa. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1173: --- Attachment: LUCENE-1173.patch I just sent email to java-user to give a heads up on this. Attach patch fixes the issue. All tests pass. I think we should spin 2.3.1 for this one? > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Assignee: Michael McCandless >Priority: Critical > Attachments: indexstress.patch, indexstress.patch, LUCENE-1173.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
One final thing, the guys responsible for the sorting in Arrays.java - Joshua Bloch and Neal Gafter. Now I KNOW there must be a very good reason for the choices they made... On Feb 11, 2008, at 9:35 AM, robert engels wrote: Also, these couple of paging have some very good information on sorting, and why heapsort is even faster than quicksort... http://users.aims.ac.za/~mackay/sorting/sorting.html http://www.azillionmonkeys.com/qed/sort.html On Feb 11, 2008, at 9:29 AM, robert engels wrote: My intent was not to diminish your hard work. We all appreciate it. I was only trying to caution that 4% gains are not all what they seem to be. If you looks at Arrays.java in the 1.5 JDK, and read through the javadoc, you will quickly see that the sorting is well-thought out. They use a tuned quicksort for primitives, which offers O(n(log (n)) performance, and a modified mergesort for Objects guaranteeing O(n(log(n)) performance. A standard quicksort has worst case performance of O(n^2) ! Both use an insertion sort if the numbers of elements is small. I can only assume that in their testing they chose a mergesort for objects was to either: 1. have stable sort times, or more likely 2. the merge sort has a better chance of being optimized by the JIT, and/or the sequential access of elements makes for more efficient object access in the JVM. These people that are far more capable than me chose one over the other for I assume very good reasons - I just wish I knew what they were. On Feb 11, 2008, at 5:14 AM, Michael McCandless wrote: In fact I've found you need to pursue both the 2x type gains and also the many smaller ones, to reach good performance. And it requires alot of ongoing vigilence to keep good performance. You lose 3-4% here and there and very quickly, very easily you're 2X slower. These tests are very real. I'm indexing Wikipedia content, using StandardAnalyzer, running under contrib/benchmark. It's true, in a real app more time will be spent pulling documents from the source, but I'm intentionally trying to minimize that in order to measure just the indexing time. Getting a 4% gain by replacing mergesort with quicksort is real. If the profiler found other 4% gains, with such a small increase in code complexity, I would passionately argue for those as well. So far it hasn't. Robert if you have some concrete ideas for the 2X type gains, I'm all ears :) I certainly agree there is a point where complexity cost doesn't offset the performance gain, but I think this particular change is well before that point. Lucene's indexing throughput is an important metric in its competitiveness with other search engines. And I want Lucene to be the best. Mike eks dev wrote: again, as long as you do not make one step forward into actual code, we will continue to have what we have today, as this is the best what we have. you made your statement: "Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms", Not a single person here would ever dispute this statement, but unfortunately there is no compiler that executes such statements. Make a patch that utilizes this "clear-code" paradigm, show us these better algorithms on actual example and than say: "without LUCENE-1172 I was able to improve XYZ feature by using ABC algorithm". That would work smooth. Anyhow, I am not going to write more on this topic, sorry for the noise... And Robert, please do not get this wrong, I see your point and I respect it! I just felt slight unfairness to the people that make the hands dirty writing as clear and fast code as possible. - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Monday, 11 February, 2008 9:55:02 AM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not disputing that there is a speed improvement. I am disputing that the performance gain of many of these patches is not worth the additional complexity in the code. Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms, not just micro improvements that the JVM (eventually) can probably figure out on its own. It is a value judgement, and regretfully I don't have another 30 years to pass down the full knowledge behind my reasoning. Luckily, however, there are some very good books available on the subject... It's not the fault of the submitter, but many of these timings are suspect due to difficulty in measuring the improvements accurately. Here is a simple example: You can configure the JVM to not perform aggressive garbage collection, and write a program that generates a lot garbage - but it runs very fast (not GCing), until the GC eventually occurs (if the program runs long enough). It may be ove
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Also, these couple of paging have some very good information on sorting, and why heapsort is even faster than quicksort... http://users.aims.ac.za/~mackay/sorting/sorting.html http://www.azillionmonkeys.com/qed/sort.html On Feb 11, 2008, at 9:29 AM, robert engels wrote: My intent was not to diminish your hard work. We all appreciate it. I was only trying to caution that 4% gains are not all what they seem to be. If you looks at Arrays.java in the 1.5 JDK, and read through the javadoc, you will quickly see that the sorting is well-thought out. They use a tuned quicksort for primitives, which offers O(n(log(n)) performance, and a modified mergesort for Objects guaranteeing O(n (log(n)) performance. A standard quicksort has worst case performance of O(n^2) ! Both use an insertion sort if the numbers of elements is small. I can only assume that in their testing they chose a mergesort for objects was to either: 1. have stable sort times, or more likely 2. the merge sort has a better chance of being optimized by the JIT, and/or the sequential access of elements makes for more efficient object access in the JVM. These people that are far more capable than me chose one over the other for I assume very good reasons - I just wish I knew what they were. On Feb 11, 2008, at 5:14 AM, Michael McCandless wrote: In fact I've found you need to pursue both the 2x type gains and also the many smaller ones, to reach good performance. And it requires alot of ongoing vigilence to keep good performance. You lose 3-4% here and there and very quickly, very easily you're 2X slower. These tests are very real. I'm indexing Wikipedia content, using StandardAnalyzer, running under contrib/benchmark. It's true, in a real app more time will be spent pulling documents from the source, but I'm intentionally trying to minimize that in order to measure just the indexing time. Getting a 4% gain by replacing mergesort with quicksort is real. If the profiler found other 4% gains, with such a small increase in code complexity, I would passionately argue for those as well. So far it hasn't. Robert if you have some concrete ideas for the 2X type gains, I'm all ears :) I certainly agree there is a point where complexity cost doesn't offset the performance gain, but I think this particular change is well before that point. Lucene's indexing throughput is an important metric in its competitiveness with other search engines. And I want Lucene to be the best. Mike eks dev wrote: again, as long as you do not make one step forward into actual code, we will continue to have what we have today, as this is the best what we have. you made your statement: "Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms", Not a single person here would ever dispute this statement, but unfortunately there is no compiler that executes such statements. Make a patch that utilizes this "clear-code" paradigm, show us these better algorithms on actual example and than say: "without LUCENE-1172 I was able to improve XYZ feature by using ABC algorithm". That would work smooth. Anyhow, I am not going to write more on this topic, sorry for the noise... And Robert, please do not get this wrong, I see your point and I respect it! I just felt slight unfairness to the people that make the hands dirty writing as clear and fast code as possible. - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Monday, 11 February, 2008 9:55:02 AM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not disputing that there is a speed improvement. I am disputing that the performance gain of many of these patches is not worth the additional complexity in the code. Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms, not just micro improvements that the JVM (eventually) can probably figure out on its own. It is a value judgement, and regretfully I don't have another 30 years to pass down the full knowledge behind my reasoning. Luckily, however, there are some very good books available on the subject... It's not the fault of the submitter, but many of these timings are suspect due to difficulty in measuring the improvements accurately. Here is a simple example: You can configure the JVM to not perform aggressive garbage collection, and write a program that generates a lot garbage - but it runs very fast (not GCing), until the GC eventually occurs (if the program runs long enough). It may be overall much slower than an alternative that runs slower as it executes, but has code to manage the objects as they are created, and rarely if ever hits a GC cycle. But then, the JVM (e.g. generational GC) can implement improvements that mak
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
My intent was not to diminish your hard work. We all appreciate it. I was only trying to caution that 4% gains are not all what they seem to be. If you looks at Arrays.java in the 1.5 JDK, and read through the javadoc, you will quickly see that the sorting is well-thought out. They use a tuned quicksort for primitives, which offers O(n(log(n)) performance, and a modified mergesort for Objects guaranteeing O(n(log (n)) performance. A standard quicksort has worst case performance of O (n^2) ! Both use an insertion sort if the numbers of elements is small. I can only assume that in their testing they chose a mergesort for objects was to either: 1. have stable sort times, or more likely 2. the merge sort has a better chance of being optimized by the JIT, and/ or the sequential access of elements makes for more efficient object access in the JVM. These people that are far more capable than me chose one over the other for I assume very good reasons - I just wish I knew what they were. On Feb 11, 2008, at 5:14 AM, Michael McCandless wrote: In fact I've found you need to pursue both the 2x type gains and also the many smaller ones, to reach good performance. And it requires alot of ongoing vigilence to keep good performance. You lose 3-4% here and there and very quickly, very easily you're 2X slower. These tests are very real. I'm indexing Wikipedia content, using StandardAnalyzer, running under contrib/benchmark. It's true, in a real app more time will be spent pulling documents from the source, but I'm intentionally trying to minimize that in order to measure just the indexing time. Getting a 4% gain by replacing mergesort with quicksort is real. If the profiler found other 4% gains, with such a small increase in code complexity, I would passionately argue for those as well. So far it hasn't. Robert if you have some concrete ideas for the 2X type gains, I'm all ears :) I certainly agree there is a point where complexity cost doesn't offset the performance gain, but I think this particular change is well before that point. Lucene's indexing throughput is an important metric in its competitiveness with other search engines. And I want Lucene to be the best. Mike eks dev wrote: again, as long as you do not make one step forward into actual code, we will continue to have what we have today, as this is the best what we have. you made your statement: "Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms", Not a single person here would ever dispute this statement, but unfortunately there is no compiler that executes such statements. Make a patch that utilizes this "clear-code" paradigm, show us these better algorithms on actual example and than say: "without LUCENE-1172 I was able to improve XYZ feature by using ABC algorithm". That would work smooth. Anyhow, I am not going to write more on this topic, sorry for the noise... And Robert, please do not get this wrong, I see your point and I respect it! I just felt slight unfairness to the people that make the hands dirty writing as clear and fast code as possible. - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Monday, 11 February, 2008 9:55:02 AM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not disputing that there is a speed improvement. I am disputing that the performance gain of many of these patches is not worth the additional complexity in the code. Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms, not just micro improvements that the JVM (eventually) can probably figure out on its own. It is a value judgement, and regretfully I don't have another 30 years to pass down the full knowledge behind my reasoning. Luckily, however, there are some very good books available on the subject... It's not the fault of the submitter, but many of these timings are suspect due to difficulty in measuring the improvements accurately. Here is a simple example: You can configure the JVM to not perform aggressive garbage collection, and write a program that generates a lot garbage - but it runs very fast (not GCing), until the GC eventually occurs (if the program runs long enough). It may be overall much slower than an alternative that runs slower as it executes, but has code to manage the objects as they are created, and rarely if ever hits a GC cycle. But then, the JVM (e.g. generational GC) can implement improvements that makes choice A faster (and the better choice)... and the cycle continues... Without detailed timings and other metrics (GC pauses, IO, memory utilization, native compilation, etc.) most benchmarks are not very accurate or useful. There are a lot of variables to consider - maybe more so tha
Re: [jira] Updated: (LUCENE-1173) index corruption autoCommit=false
Michael McCandless (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Michael McCandless updated LUCENE-1173: > --- > > Attachment: LUCENE-1173.patch > > I just sent email to java-user to give a heads up on this. > > Attach patch fixes the issue. All tests pass. > > I think we should spin 2.3.1 for this one? > +1 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1173: -- Assignee: Michael McCandless > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Assignee: Michael McCandless >Priority: Critical > Attachments: indexstress.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene-based Distributed Index Leveraging Hadoop
I am guessing that the idea behind not putting the indexes in HDFS is (1) maximize performance; (2) they are relatively transient - meaning the data they are created from could be in HDFS, but the indexes themselves are just local. To avoid having to recreate them, a backup copy could be kept in HDFS. Since a goal is to be able to update them (frequently), this seems like a good approach to me. Tim Andrzej Bialecki wrote: Doug Cutting wrote: My primary difference with your proposal is that I would like to support online indexing. Documents could be inserted and removed directly, and shards would synchronize changes amongst replicas, with an "eventual consistency" model. Indexes would not be stored in HDFS, but directly on the local disk of each node. Hadoop would perhaps not play a role. In many ways this would resemble CouchDB, but with explicit support for sharding and failover from the outset. It's true that searching over HDFS is slow - but I'd hate to lose all other HDFS benefits and have to start from scratch ... I wonder what would be the performance of FsDirectory over an HDFS index that is "pinned" to a local disk, i.e. a full local replica is available, with block size of each index file equal to the file size. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1175) occasional MergeException while indexing
[ https://issues.apache.org/jira/browse/LUCENE-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567900#action_12567900 ] Yonik Seeley commented on LUCENE-1175: -- Another exception, this time during IndexReader.open() after an indexing run. {code} java.io.FileNotFoundException: _a.fdt at org.apache.lucene.store.RAMDirectory.openInput(RAMDirectory.java:234) at org.apache.lucene.store.Directory.openInput(Directory.java:104) at org.apache.lucene.index.FieldsReader.(FieldsReader.java:75) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:308) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:197) at org.apache.lucene.index.MultiSegmentReader.(MultiSegmentReader.java:55) at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:91) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:651) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:79) at org.apache.lucene.index.IndexReader.open(IndexReader.java:209) at org.apache.lucene.index.IndexReader.open(IndexReader.java:192) at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:161) at org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:72) {code} > occasional MergeException while indexing > > > Key: LUCENE-1175 > URL: https://issues.apache.org/jira/browse/LUCENE-1175 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.3 >Reporter: Yonik Seeley > > TestStressIndexing2.testMultiConfig occasionally hits merge exceptions -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1175) occasional MergeException while indexing
[ https://issues.apache.org/jira/browse/LUCENE-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567880#action_12567880 ] Yonik Seeley commented on LUCENE-1175: -- OK, not much info to reproduce at this point, except to put the iteratons to 100 on testMultiConfig and let it run for a while. Here is an example exception: {code} Exception in thread "Lucene Merge Thread #1" org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: _5_1.del at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:320) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:297) Caused by: java.io.FileNotFoundException: _5_1.del at org.apache.lucene.store.RAMDirectory.fileLength(RAMDirectory.java:167) at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:216) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3750) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3354) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:211) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:266) {code} It could potentially either be a problem in indexing, or in RAMDirectory. > occasional MergeException while indexing > > > Key: LUCENE-1175 > URL: https://issues.apache.org/jira/browse/LUCENE-1175 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.3 >Reporter: Yonik Seeley > > TestStressIndexing2.testMultiConfig occasionally hits merge exceptions -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1175) occasional MergeException while indexing
occasional MergeException while indexing Key: LUCENE-1175 URL: https://issues.apache.org/jira/browse/LUCENE-1175 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.3 Reporter: Yonik Seeley TestStressIndexing2.testMultiConfig occasionally hits merge exceptions -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1171) Make DocumentsWriter more robust on hitting OOM
[ https://issues.apache.org/jira/browse/LUCENE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1171. Resolution: Fixed > Make DocumentsWriter more robust on hitting OOM > --- > > Key: LUCENE-1171 > URL: https://issues.apache.org/jira/browse/LUCENE-1171 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1171.patch > > > I've been stress testing DocumentsWriter by indexing wikipedia, but not > giving enough memory to the JVM, in varying heap sizes to tickle the > different interesting cases. Sometimes DocumentsWriter can deadlock; > other times it will hit a subsequent NPE or AIOOBE or assertion > failure. > I've fixed all the cases I've found, and added some more asserts. Now > it just produces plain OOM exceptions. All changes are contained to > DocumentsWriter.java. > All tests pass. I plan to commit in a day or two! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
In fact I've found you need to pursue both the 2x type gains and also the many smaller ones, to reach good performance. And it requires alot of ongoing vigilence to keep good performance. You lose 3-4% here and there and very quickly, very easily you're 2X slower. These tests are very real. I'm indexing Wikipedia content, using StandardAnalyzer, running under contrib/benchmark. It's true, in a real app more time will be spent pulling documents from the source, but I'm intentionally trying to minimize that in order to measure just the indexing time. Getting a 4% gain by replacing mergesort with quicksort is real. If the profiler found other 4% gains, with such a small increase in code complexity, I would passionately argue for those as well. So far it hasn't. Robert if you have some concrete ideas for the 2X type gains, I'm all ears :) I certainly agree there is a point where complexity cost doesn't offset the performance gain, but I think this particular change is well before that point. Lucene's indexing throughput is an important metric in its competitiveness with other search engines. And I want Lucene to be the best. Mike eks dev wrote: again, as long as you do not make one step forward into actual code, we will continue to have what we have today, as this is the best what we have. you made your statement: "Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms", Not a single person here would ever dispute this statement, but unfortunately there is no compiler that executes such statements. Make a patch that utilizes this "clear-code" paradigm, show us these better algorithms on actual example and than say: "without LUCENE-1172 I was able to improve XYZ feature by using ABC algorithm". That would work smooth. Anyhow, I am not going to write more on this topic, sorry for the noise... And Robert, please do not get this wrong, I see your point and I respect it! I just felt slight unfairness to the people that make the hands dirty writing as clear and fast code as possible. - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Monday, 11 February, 2008 9:55:02 AM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not disputing that there is a speed improvement. I am disputing that the performance gain of many of these patches is not worth the additional complexity in the code. Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms, not just micro improvements that the JVM (eventually) can probably figure out on its own. It is a value judgement, and regretfully I don't have another 30 years to pass down the full knowledge behind my reasoning. Luckily, however, there are some very good books available on the subject... It's not the fault of the submitter, but many of these timings are suspect due to difficulty in measuring the improvements accurately. Here is a simple example: You can configure the JVM to not perform aggressive garbage collection, and write a program that generates a lot garbage - but it runs very fast (not GCing), until the GC eventually occurs (if the program runs long enough). It may be overall much slower than an alternative that runs slower as it executes, but has code to manage the objects as they are created, and rarely if ever hits a GC cycle. But then, the JVM (e.g. generational GC) can implement improvements that makes choice A faster (and the better choice)... and the cycle continues... Without detailed timings and other metrics (GC pauses, IO, memory utilization, native compilation, etc.) most benchmarks are not very accurate or useful. There are a lot of variables to consider - maybe more so than can reasonably be considered. That is why a 4% gain is highly suspect. If the gain was 25%, or 50% or 100%, you have a better chance of it being an innate improvement, and not just the interaction of some other factors. On Feb 11, 2008, at 2:32 AM, eks dev wrote: Robert, you may or may not be right, I do not know. The only way to prove it would be to show you can do it better, no? If you are so convinced this is wrong, you could, much better than quoting textbooks: a) write better patch, get attention with something you think is "better bottleneck" b) provide realistic "performance tests" as you dispute the measurement provided here It has to be that concrete, academic discussions are cool, but at the end of a day, it is the code that executes that counts. cheers, eks - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Sunday, 10 February, 2008 9:15:30 PM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not sure these numbers matter. I think they are skewed because yo
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
again, as long as you do not make one step forward into actual code, we will continue to have what we have today, as this is the best what we have. you made your statement: "Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms", Not a single person here would ever dispute this statement, but unfortunately there is no compiler that executes such statements. Make a patch that utilizes this "clear-code" paradigm, show us these better algorithms on actual example and than say: "without LUCENE-1172 I was able to improve XYZ feature by using ABC algorithm". That would work smooth. Anyhow, I am not going to write more on this topic, sorry for the noise... And Robert, please do not get this wrong, I see your point and I respect it! I just felt slight unfairness to the people that make the hands dirty writing as clear and fast code as possible. - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Monday, 11 February, 2008 9:55:02 AM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not disputing that there is a speed improvement. I am disputing that the performance gain of many of these patches is not worth the additional complexity in the code. Clear code will allow for more radical improvements as more eyes will be able to easily understand the inner workings and offer better algorithms, not just micro improvements that the JVM (eventually) can probably figure out on its own. It is a value judgement, and regretfully I don't have another 30 years to pass down the full knowledge behind my reasoning. Luckily, however, there are some very good books available on the subject... It's not the fault of the submitter, but many of these timings are suspect due to difficulty in measuring the improvements accurately. Here is a simple example: You can configure the JVM to not perform aggressive garbage collection, and write a program that generates a lot garbage - but it runs very fast (not GCing), until the GC eventually occurs (if the program runs long enough). It may be overall much slower than an alternative that runs slower as it executes, but has code to manage the objects as they are created, and rarely if ever hits a GC cycle. But then, the JVM (e.g. generational GC) can implement improvements that makes choice A faster (and the better choice)... and the cycle continues... Without detailed timings and other metrics (GC pauses, IO, memory utilization, native compilation, etc.) most benchmarks are not very accurate or useful. There are a lot of variables to consider - maybe more so than can reasonably be considered. That is why a 4% gain is highly suspect. If the gain was 25%, or 50% or 100%, you have a better chance of it being an innate improvement, and not just the interaction of some other factors. On Feb 11, 2008, at 2:32 AM, eks dev wrote: > Robert, > > you may or may not be right, I do not know. The only way to prove > it would be to show you can do it better, no? > If you are so convinced this is wrong, you could, much better than > quoting textbooks: > > a) write better patch, get attention with something you think is > "better bottleneck" > b) provide realistic "performance tests" as you dispute the > measurement provided here > > It has to be that concrete, academic discussions are cool, but at > the end of a day, it is the code that executes that counts. > > cheers, > eks > > - Original Message > From: robert engels <[EMAIL PROTECTED]> > To: java-dev@lucene.apache.org > Sent: Sunday, 10 February, 2008 9:15:30 PM > Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to > DocumentsWriter > > I am not sure these numbers matter. I think they are skewed because > you are probably running too short a test, and the index is in memory > (or OS cache). > > Once you use a real index that needs to read/write from the disk, the > percentage change will be negligible. > > This is the problem with many of these "performance changes" - they > just aren't real world enough. Even if they were, I would argue that > code simplicity/maintainability is worth more than 6 seconds on a > operation that takes 4 minutes to run... > > There are many people that believe micro benchmarks are next to > worthless. A good rule of thumb is that if the optimization doesn't > result in 2x speedup, it probably shouldn't be done. In most cases > any efficiency gains are later lost in maintainability issues. > > See http://en.wikipedia.org/wiki/Optimization_(computer_science) > > Almost always there is a better bottleneck somewhere. > > On Feb 10, 2008, at 1:37 PM, Michael McCandless wrote: > >> >> Yonik Seeley wrote: >> >>> I wonder how well a single generic quickSort(Object[] arr, int low, >>> int high) would perform vs the type-specific ones? I guess the main >>
[jira] Commented: (LUCENE-167) [PATCH] QueryParser not handling queries containing AND and OR
[ https://issues.apache.org/jira/browse/LUCENE-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567556#action_12567556 ] Graham Maloon commented on LUCENE-167: -- I see that very little has been done with this since 2005. Are there any plans to incorporate a fix into the current build? How can I get my hands on a copy of the fix that will work with 2.3.0? > [PATCH] QueryParser not handling queries containing AND and OR > -- > > Key: LUCENE-167 > URL: https://issues.apache.org/jira/browse/LUCENE-167 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: unspecified > Environment: Operating System: Linux > Platform: PC >Reporter: Morus Walter >Assignee: Erik Hatcher > Attachments: LuceneTest.java, QueryParser.jj.patch, QueryParser.patch > > > The QueryParser does not seem to handle boolean queries containing AND and OR > operators correctly: > e.g. > a AND b OR c AND d gets parsed as +a +b +c +d. > The attached patch fixes this by changing the vector of boolean clauses into a > vector of vectors of boolean clauses in the addClause method of the query > parser. A new sub-vector is created whenever an explicit OR operator is used. > Queries using explicit AND/OR are grouped by precedence of AND over OR. That > is > a OR b AND c gets a OR (b AND c). > Queries using implicit AND/OR (depending on the default operator) are handled > as > before (so one can still use a +b -c to create one boolean query, where b is > required, c forbidden and a optional). > It's less clear how a query using both explizit AND/OR and implicit operators > should be handled. > Since the patch groups on explicit OR operators a query > a OR b c is read as a (b c) > whereas > a AND b c as +a +b c > (given that default operator or is used). > There's one issue left: > The old query parser reads a query > `a OR NOT b' as `a -b' which is the same as `a AND NOT b'. > The modified query parser reads this as `a (-b)'. > While this looks better (at least to me), it does not produce the result of a > OR > NOT b. Instead the (-b) part seems to be silently dropped. > While I understand that this query is illegal (just searching for one negative > term) I don't think that silently dropping this part is an appropriate way to > deal with that. But I don't think that's a query parser issue. > The only question is, if the query parser should take care of that. > I attached the patch (made against 1.3rc3 but working for 1.3final as well) > and > a test program. > The test program parses a number of queries with default-or and default-and > operator and reparses the result of the toString method of the created query. > It outputs the initial query, the parsed query with default or, the reparesed > query, the parsed query with the default and it's reparsed query. > If called with a -q option, it also run's the queries against an index > consisting of all documentes containing one or none a b c or d. > Using an unpatched and a patched version of lucene in the classpath one can > look > at the effect of the patch in detail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1170) query with AND and OR not retrieving correct results
[ https://issues.apache.org/jira/browse/LUCENE-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567550#action_12567550 ] Graham Maloon commented on LUCENE-1170: --- Lucene-167 has a patch for the version in 2005. Has this not been incorporated into the newer releases to fix this problem? > query with AND and OR not retrieving correct results > > > Key: LUCENE-1170 > URL: https://issues.apache.org/jira/browse/LUCENE-1170 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 2.3 > Environment: linux and windows >Reporter: Graham Maloon > > I was working with Lucene 1.4, and have now upgraded to 2.3.0 but there is > still a problem that I am experiencing with the Queryparser > > I am passing the following queries: > > "big brother" - works fine > "big brother" AND dubai - works fine > "big brother" AND football - works fine > "big brother" AND dubai OR football - returns extra documents which contain > "big brother" but do not contain either dubai or football. > "big brother" AND (dubai OR football) gives the same as the one above > > Am I doing something wrong? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
The reason it needs (or should be done) on Unix, is that it is much easier (and better I think) at reporting the "real" timings. What the reporter stated was in (most likely) real time? which is not the best way to measure performance - especially on multi user/tasking OSes. The unix time facilities give a better picture of exactly why the program took the amount of time to execute. On Feb 11, 2008, at 1:07 AM, Mike Klaas wrote: Certainly others do agree with you to some degree that this case is on the cost/benefit borderline. Again, this case wasn't really the point. My point was it feels to me that you have, on occasion, been over- quick to criticize without paying sufficiently respectful attention to the details of what is being discussed. For instance, the criticism of "these tests should be done on a *nix platform" to someone who has repeated the tests on osx (yes, a nix) and windows. Or that the test is too short and the index in memory (it was 10MM docs with term vecs on FSDirectory. It is possible that some of the index wasn't fsync'd at the end of each test, I suppose, but I would expect this to be a small amount and equivalent in pre- and post-patch scenarios). Or calling a full index run of 10MM docs a "micro benchmark". I do think that I was unchill in sending the original post to the list instead of to you via personal mail. I shouldn't have. regards, -Mike On 10-Feb-08, at 7:33 PM, robert engels wrote: Please chill. You are inferring something that was not implied. You may think it lacks perspective and respect (I disagree on both), but it certainly doesn't lack in correctness. First, depending on how you measure it, 2x speedup equates to a 50% reduction in time. In my review of the changes that brought about the biggest performance gains from 1.9 on, almost all were related to avoiding disk accesses by buffering more documents and doing more processing in memory. I don't think many of the micro- benchmarks mattered much, and with a JVM environment it is very difficult to prove as it is going to be heavily JVM and configuration dependent. The main point was that ANY disk access is going to be ORDERS OF MAGNITUDE slower than any of these sort of optimizations. So either you are loading the index completely in memory (only small indexes, so the difference in speed is not going to matter much), or you might be using a federated system of memory indices (to form a large index), but USUALLY at some point the index must be first created in a persistent store (that which is covered here), in order to provide realistic restart times, etc. The author of the patch and timings gives no information as to disk speed, IO speed, controllers, raid configuration , etc. When creating an index in persistent store, these factors matter more than a 2-4% speed up. Creating an index completely in memory is then bound by the reading of the data from the disk, and/or the network - all much slower than the actual indexing. Usually optimizations like this only matter in areas of development where the data set is small, but the processing large (a lot of numerical analysis). In some cases the data set may also be "large", but then usually the processing is exponentially larger. The building of the index in Lucene in not very computationally expensive. If you are going spend hundreds of hours "optimizing", you best be optimizing the right things. That was the point of the link I sent (the quotes are from people far more capable than I). I was trying to make the point that a 2-4 % speed up probably doesn't amount to much in a real environment given all of the other factors, so it is probably better for the project/community to err on the side of code clarity and ease of maintenance. The project can continue to do what it wants (obviously) - but what I was pointing out should be nothing new to experienced designers/developers - I only offering a reminder. It is my observation (others will disagree !), but I think a lot of Lucene has some unneeded esoteric code, where the benefit doesn't match the cost. On Feb 10, 2008, at 5:48 PM, Mike Klaas wrote: While I agree in general that excessive optimization at the expense of code clarity is undesirable, you are overstating the point. 2X is a ridiculous threshold to apply to something as performance critical as a full text search engine. If search was twice as slow, lucene would be utterly unusable for me. Indexing less important than search, of course, but a 2X slowdown with be quite painful there. I don't have an opinion in this case: I believe that there is a tradeoff but that it is the responsibility of the commiter(s) to achieve the correct balance--they are the ones who will be maintaining the code, after all. I find your persistence surprising and your tone dangerously near condescending. Telling the gu
Re: [jira] Commented: (LUCENE-1173) index corruption autoCommit=false
Yonik Seeley (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567878#action_12567878 > ] > > Yonik Seeley commented on LUCENE-1173: > -- > > Hold up a bit... my random testing may have hit another bug > testMultiConfig hit an error at some point when I cranked up the > iterations... I'm trying to reproduce. > OK, I suggest that we should wait a couple of days before we cut 2.3.1 in case there are more problems. We should backport the patches and commit them to the 2.3 branch. I'll then end of this week create a 2.3.1 tag, build release artifacts and call a vote. Sounds good? -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567878#action_12567878 ] Yonik Seeley commented on LUCENE-1173: -- Hold up a bit... my random testing may have hit another bug testMultiConfig hit an error at some point when I cranked up the iterations... I'm trying to reproduce. > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Assignee: Michael McCandless >Priority: Critical > Attachments: indexstress.patch, indexstress.patch, LUCENE-1173.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1173) index corruption autoCommit=false
[ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567873#action_12567873 ] Yonik Seeley commented on LUCENE-1173: -- Patch looks good (heh... a one liner!) At least it won't break previously working code since autoCommit=true is the default. The only risk is people trying out the new setting and not realizing it can break things. 2.3.1 might be nice, but I'll leave to others (who have the actual time to do the work) to decide. > index corruption autoCommit=false > - > > Key: LUCENE-1173 > URL: https://issues.apache.org/jira/browse/LUCENE-1173 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Yonik Seeley >Assignee: Michael McCandless >Priority: Critical > Attachments: indexstress.patch, indexstress.patch, LUCENE-1173.patch > > > In both Lucene 2.3 and trunk, the index becomes corrupted when > autoCommit=false -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-1173) index corruption autoCommit=false
OK I'll backport this fix. I'd also like to backport LUCENE-1168 (another corruption case when autoCommit=false) and LUCENE-1171 (deadlock on hitting OOM). Mike Michael Busch wrote: Michael McCandless (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1173? page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1173: --- Attachment: LUCENE-1173.patch I just sent email to java-user to give a heads up on this. Attach patch fixes the issue. All tests pass. I think we should spin 2.3.1 for this one? +1 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Robert, you may or may not be right, I do not know. The only way to prove it would be to show you can do it better, no? If you are so convinced this is wrong, you could, much better than quoting textbooks: a) write better patch, get attention with something you think is "better bottleneck" b) provide realistic "performance tests" as you dispute the measurement provided here It has to be that concrete, academic discussions are cool, but at the end of a day, it is the code that executes that counts. cheers, eks - Original Message From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Sunday, 10 February, 2008 9:15:30 PM Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter I am not sure these numbers matter. I think they are skewed because you are probably running too short a test, and the index is in memory (or OS cache). Once you use a real index that needs to read/write from the disk, the percentage change will be negligible. This is the problem with many of these "performance changes" - they just aren't real world enough. Even if they were, I would argue that code simplicity/maintainability is worth more than 6 seconds on a operation that takes 4 minutes to run... There are many people that believe micro benchmarks are next to worthless. A good rule of thumb is that if the optimization doesn't result in 2x speedup, it probably shouldn't be done. In most cases any efficiency gains are later lost in maintainability issues. See http://en.wikipedia.org/wiki/Optimization_(computer_science) Almost always there is a better bottleneck somewhere. On Feb 10, 2008, at 1:37 PM, Michael McCandless wrote: > > Yonik Seeley wrote: > >> I wonder how well a single generic quickSort(Object[] arr, int low, >> int high) would perform vs the type-specific ones? I guess the main >> overhead would be a cast from Object to the specific class to do the >> compare? Too bad Java doesn't have true generics/templates. > > > OK I tested this. > > Starting from the patch on LUCENE-1172, which has 3 quickSort methods > (one per type), I created a single quickSort method on Object[] that > takes a Comparator, and made 3 Comparators instead. > > Mac OS X 10.4 (JVM 1.5): > > original patch --> 247.1 > simplified patch --> 254.9 (3.2% slower) > > Windows Server 2003 R64 (JVM 1.6): > > original patch --> 440.6 > simplified patch --> 452.7 (2.7% slower) > > The times are best in 10 runs. I'm running all tests with these JVM > args: > > -Xms1024M -Xmx1024M -Xbatch -server > > I think this is a big enough difference in performance that it's > worth keeping 3 separate quickSorts in DocumentsWriter. > > Mike > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]