[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592497#comment-15592497 ] Keith Turner commented on ACCUMULO-4468: i am in favor of merging this in > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592208#comment-15592208 ] Josh Elser commented on ACCUMULO-4468: -- [~kturner], what do you think we should do here? > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513616#comment-15513616 ] Keith Turner commented on ACCUMULO-4468: The network code uses the [Key.compress()|https://github.com/apache/accumulo/blob/1a663143c4d81367f7703bceb8b24374d59c154e/core/src/main/java/org/apache/accumulo/core/data/Key.java#L1085] and [Key.decompress()|https://github.com/apache/accumulo/blob/1a663143c4d81367f7703bceb8b24374d59c154e/core/src/main/java/org/apache/accumulo/core/data/Key.java#L1137] functions, maybe the test could use that to simulate what rfile does. Its based on thrift stuff though, so it would clunky to use. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513602#comment-15513602 ] Keith Turner commented on ACCUMULO-4468: bq. That said, maybe there would be benefits to storing all the pieces of the Key in a single byte array, and maintaining indices into it to track the individual parts, rather than several smaller arrays... Key used to be like that (one large byte array with pointers). I changed it when adding support for relative compression in rfile. The reasoning behind the change was so that when rfile deserializes a key and a field is the same as the last key, it can just point to the previous byte array. This makes equality comparisons on rows or columns that are the same really fast (because the byte array is the same and equality checks that). The code that serializes keys and transfers them across the network also does this. So it may be interesting to have the test stream keys from an RFile. bq. I think it would be worth revisiting the comparison mechanism in isEqual, too, doing something like the Unsafe method used in Hadoop's FastByteComparisons class but going in reverse. Thats sounds like an interesting line of investigation. The compare methods could also leverage this technique. We may have an issue open for this already. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513435#comment-15513435 ] Will Murnane commented on ACCUMULO-4468: [~elserj] I'm not sure why customVanilla and standardEquals behave differently. The difference is small, and perhaps it's due to variance in the JDK used to compile the standard Accumulo JAR versus the one used to compile the benchmark code? Maybe there are effects from having the code loaded from a small JAR versus a large one? Maybe the custom WillKey class gets laid out in memory differently, and it hits instruction cache differently? This is the problem of benchmarking... RE: generation of data, yeah, the current test data... leaves something to be desired. This was basically the least-worst mechanism I could come up with in 5 minutes to generate some test data that kinda-sorta resembles our production data. If anyone has a better strategy I'm willing to do a little legwork testing other data sets. [~kturner] The parts of the key are stored on the heap somewhere, so the problem of row equality is somewhat different than the problem of comparing two contiguous byte arrays. That said, maybe there would be benefits to storing all the pieces of the Key in a single byte array, and maintaining indices into it to track the individual parts, rather than several smaller arrays... That's a big refactor, though, for an unknown change in performance. I think it would be worth revisiting the comparison mechanism in isEqual, too, doing something like the Unsafe method used in Hadoop's FastByteComparisons class but going in reverse. The CPU's speculative prefetch should work in either direction, but doing the comparison byte-at-a-time is going to be more expensive than the 64-bit comparisons that FastByteComparisons does. But that's a topic for another ticket ;) > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511949#comment-15511949 ] Josh Elser commented on ACCUMULO-4468: -- Reran Will's original test and can vouch for the results. I also modified the benchmark to do a full equality check (Row to Deletion flag) with the same data generation: {noformat} Benchmark Mode Cnt Score Error Units MyBenchmark.customVanilla thrpt 200 81.514 ± 3.989 ops/s MyBenchmark.customWill thrpt 200 96.185 ± 1.736 ops/s {noformat} I'm not super happy with the data generation actually being representative, but I am warming up to these changes having a positive net effect. I commonly think of the following representation. For each row (which would be relatively close to each other): * a few column families * 10 to 15 qualifiers spread across the families * A few timestamps spread across the keys in one row This models attributes on some "object" which is stored in one row. There is some logical partitioning of the attributes. Most attributes are written once, some are updated over time. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511430#comment-15511430 ] Josh Elser commented on ACCUMULO-4468: -- Looking at the repo, it's important that the code is not doing a {{previous.equals(next)}} but actually {{previous.equals(next, PartialKey.ROW_COLFAM)}}. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511418#comment-15511418 ] Keith Turner commented on ACCUMULO-4468: [~wmurnane] I like the change. I wrote the comment and made the changes in Key that you referenced. I remember doing the performance testing for that. When I first experimented with the change, I tried comparing the byte arrays in reveres order. That was much slower than comparing forward. So I avoiding that and found another strategy that worked well. I think the concept is sound, but the performance testing is definitely needed to make sure it works as intended. I also like the switch statement fall through, I think its slick. If it doesn't exists, It would be nice to add a unit test that checks for correctness for keys that only differ by one field. Basically ensure that each of the key field comparisons is tested. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511398#comment-15511398 ] Josh Elser commented on ACCUMULO-4468: -- Great stuff, [~wmurnane]! A few thoughts: bq. the original equals() copied to the new class is called customVanilla, and the original equals() in the original Key class is called standardEquals. The numbers to compare are really the two custom* ones; the standardEquals value is given just to show that it's in the same ballpark. I'm not grok'ing the difference between customVanilla and standardEquals. Why the variance in these two? Shouldn't they be essentially equivalent? bq. As a user of the API, I'd rather not have to think about equalsForward() versus equalsBackward(). I concur with you here. bq. I agree that any change should start with prejudice against it. However, I think the numbers above prove my case: when keys are presented in sorted order, which happens often in Accumulo, the proposed method of comparing is slightly but noticeably faster. The degree of improvement depends on the data, but it doesn't perform worse than the current solution in any case that I tested. Great, I hoped you didn't take my initial prejudice badly :). This is a great start. I'm curious trying to tweak a couple of other things. Sharing your project was super useful. Ultimately, if these numbers are as they appear (better in some cases, no worse in others), this is a great improvement. Expecting large contiguous blocks of keys where row or row+cf change very infrequently makes sense to optimize. It appears that {{compareTo(Object}} is also using a separate code path, so I don't think this would have a big affect on things like creating RFiles for bulk imports. I need to search through usages though. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: benchmark.tar.gz, key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511154#comment-15511154 ] Will Murnane commented on ACCUMULO-4468: Responding to [~ctubbsii]'s points first: # Most things in Accumulo are dealt with in sorted order, so I think it's reasonable to optimize for the case where things are sorted. Whether this is true in the general case is a function of how one calls the function, but it's not a change to the functionality of the code, just its performance. All the places I can see it used in Accumulo, the keys that are being compared are coming from SortedKeyValueIterator, so it's reasonable that we're comparing keys which are sorted, and the new behavior will be an improvement. # I ran some tests with JMH to see what the difference is like. I can post the whole project, but here are some selected results. There are three functions being compared. I implemented a subclass of Key which has new methods: my proposed equals() function is called customWill, the original equals() copied to the new class is called customVanilla, and the original equals() in the original Key class is called standardEquals. The numbers to compare are really the two custom* ones; the standardEquals value is given just to show that it's in the same ballpark. \\ The test data set is generated before the benchmark begins, and contains 1m keys to go through, calling previous.equals(next, ROW_COLFAM). This is a worst-case scenario for sorted input, because the most that it can manage to avoid doing (as compared to the current implementation) is comparing the row values. ## I generated rows whose rowID are all equal, and the column family changes every key. {noformat} BenchmarkMode Cnt Score Error Units MyBenchmark.customVanilla thrpt 30 46.320 ± 3.803 ops/s MyBenchmark.customWill thrpt 30 88.349 ± 2.723 ops/s MyBenchmark.standardEquals thrpt 30 36.736 ± 0.883 ops/s {noformat} ## I generated rows whose rowID are all equal, and the column family changes every 3 keys. {noformat} BenchmarkMode Cnt Score Error Units MyBenchmark.customVanilla thrpt 30 30.684 ± 1.258 ops/s MyBenchmark.customWill thrpt 30 34.292 ± 1.339 ops/s MyBenchmark.standardEquals thrpt 30 27.277 ± 0.984 ops/s {noformat} ## I generated keys whose rowID are all equal, and the column family changes every 5 keys. {noformat} BenchmarkMode Cnt Score Error Units MyBenchmark.customVanilla thrpt 30 27.195 ± 0.895 ops/s MyBenchmark.customWill thrpt 30 30.048 ± 0.838 ops/s MyBenchmark.standardEquals thrpt 30 25.044 ± 0.731 ops/s {noformat} ## Finally, I generated keys whose rowID are all equal, and the column family changes every 1000 keys. {noformat} BenchmarkMode Cnt Score Error Units MyBenchmark.customVanilla thrpt 30 23.447 ± 1.010 ops/s MyBenchmark.customWill thrpt 30 23.427 ± 0.371 ops/s MyBenchmark.standardEquals thrpt 30 22.356 ± 0.192 ops/s {noformat} # As a user of the API, I'd rather not have to think about equalsForward() versus equalsBackward(). Maybe add an optional flag to specify direction of comparison, for # I agree in general, but I would argue that the code of the current equals() method is messier and harder to read. It repeats the same code quite a bit. I wrote a comment in the patch remarking on the fact that there's fallthrough and it's used intentionally, to try to prevent future confusion. [~elserj] I agree that any change should start with prejudice against it. However, I think the numbers above prove my case: when keys are presented in sorted order, which happens often in Accumulo, the proposed method of comparing is slightly but noticeably faster. The degree of improvement depends on the data, but it doesn't perform worse than the current solution in any case that I tested. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510743#comment-15510743 ] Josh Elser commented on ACCUMULO-4468: -- {quote} bq.Do you have other examples of where this might be used in a tight loop? I think there are lots of other examples in Accumulo itself; for example, in iterators.system.DeletingIterator and iterators.user.TransformingIterator, this method is called in a loop. In our use case, we're doing a similar thing: building a larger object out of multiple rows, by finding groups of rows which are equal under ROW_COLFAM. When each object is built from only a few rows, the CF equality comparison returns false pretty often (which is to be expected), but only after comparing row IDs, which are always the same in practice. {quote} Perhaps my concern didn't come across. From my perspective: I am concerned with a performance change that makes one case better. We need to understand if the case you outlined is "the norm" or "the exception". I was hoping you had context on this. {quote} bq. How did you test this? What types of numbers did you see? I haven't been able to install it on a cluster to test. The test suite does pass with this patch applied. I think it's a minor change; in the "rows are equal" case the same amount of work is done as with the existing code, although the parts are accessed in the opposite order. They're still compared mostly-in-order, as isEqual does, but the comment in that function was inspiration to try reversing the comparison order. Aside from performance, the code seems cleaner to me: there's no more repetition of e.g. the check of row equality. The bytecode (with Oracle javac 1.8.0_92) is substantially smaller: 389 bytes versus 167. {quote} At risk of being anti-social, I am -1 on any change for performance without numbers coming with it. There are many great tools out there to benchmark changes and don't necessarily require use of a cluster (it might actually be harder to test on a cluster). [JMH|http://openjdk.java.net/projects/code-tools/jmh/] and [Google Caliper|https://github.com/google/caliper] are two good tools for micro-benchmarking. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510736#comment-15510736 ] Christopher Tubbs commented on ACCUMULO-4468: - Hi [~wmurnane]. Thanks for the patch! After reviewing, I had a few comments: 1. This may speed things up only if the probability of not being equal is greater in the lower dimensions of the key than the higher ones. I'm not sure this is the case, as I don't have a strong sense of when this method is called, or how frequently. What analysis have you done to determine these relative probabilities? 2. Have you run any experiments to determine the performance differences in various use cases? 3. If this method is primarily used in user code, would a new method to compare in reverse order be better, to account for both cases? Perhaps it'd be better to optimize the Combiner code, rather than change the default behavior for all cases? 4. I think relying on fall-through behavior in switch statements should be avoided. It's prone to error, especially as code is refactored over time. I think it's better to avoid it than to suppress the warning. This may be a style choice, but it's a preference that the java compiler weighs in on (by making fallthrough a default compiler warning), and I'd prefer to avoid behavior which results in compiler warnings whenever possible. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510679#comment-15510679 ] Will Murnane commented on ACCUMULO-4468: bq. Do you have other examples of where this might be used in a tight loop? I think there are lots of other examples in Accumulo itself; for example, in iterators.system.DeletingIterator and iterators.user.TransformingIterator, this method is called in a loop. In our use case, we're doing a similar thing: building a larger object out of multiple rows, by finding groups of rows which are equal under ROW_COLFAM. When each object is built from only a few rows, the CF equality comparison returns false pretty often (which is to be expected), but only after comparing row IDs, which are always the same in practice. bq. How did you test this? What types of numbers did you see? I haven't been able to install it on a cluster to test. The test suite does pass with this patch applied. I think it's a minor change; in the "rows are equal" case the same amount of work is done as with the existing code, although the parts are accessed in the opposite order. They're still compared mostly-in-order, as isEqual does, but the comment in that function was inspiration to try reversing the comparison order. Aside from performance, the code seems cleaner to me: there's no more repetition of e.g. the check of row equality. The bytecode (with Oracle javac 1.8.0_92) is substantially smaller: 389 bytes versus 167. bq. Why not consolidate this to: That's fine with me. I just wrote all the cases to look the same, instead of having a "special case" for the last comparison made. If some special work were required in the compare-equal case, it could go before the return statement. > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
[ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510541#comment-15510541 ] Josh Elser commented on ACCUMULO-4468: -- Hi [~wmurnane]. Thanks for the patch. bq. This functions correctly, of course, but one of the typical uses of this method is to compare adjacent rows to break them into larger chunks. For example, accumulo.core.iterators.Combiner repeatedly calls this method with subsequent pairs of keys. Do you have other examples of where this might be used in a tight loop? bq. This (marginally) improves the speed of comparisons in the relatively common case where only the last part is changing, with less complex code. How did you test this? What types of numbers did you see? {code} + case ROW: +if (!isEqual(row, other.row)) + return false; +break; default: throw new IllegalArgumentException("Unrecognized partial key specification " + part); } +return true; {code} Why not consolidate this to: {code} case ROW: return isEquals(row, other.row); {code} > accumulo.core.data.Key.equals(Key, PartialKey) improvement > -- > > Key: ACCUMULO-4468 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4468 > Project: Accumulo > Issue Type: Improvement > Components: core >Affects Versions: 1.8.0 >Reporter: Will Murnane >Priority: Trivial > Labels: newbie, performance > Attachments: key_comparison.patch > > > In the Key.equals(Key, PartialKey) overload, the current method compares > starting at the beginning of the key, and works its way toward the end. This > functions correctly, of course, but one of the typical uses of this method is > to compare adjacent rows to break them into larger chunks. For example, > accumulo.core.iterators.Combiner repeatedly calls this method with subsequent > pairs of keys. > I have a patch which reverses the comparison order. That is, if the method is > called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, > and finally row. This (marginally) improves the speed of comparisons in the > relatively common case where only the last part is changing, with less > complex code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)