[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement

2016-09-22 Thread Will Murnane (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513435#comment-15513435
 ] 

Will Murnane commented on ACCUMULO-4468:


[~elserj] I'm not sure why customVanilla and standardEquals behave differently. 
The difference is small, and perhaps it's due to variance in the JDK used to 
compile the standard Accumulo JAR versus the one used to compile the benchmark 
code? Maybe there are effects from having the code loaded from a small JAR 
versus a large one? Maybe the custom WillKey class gets laid out in memory 
differently, and it hits instruction cache differently? This is the problem of 
benchmarking...

RE: generation of data, yeah, the current test data... leaves something to be 
desired. This was basically the least-worst mechanism I could come up with in 5 
minutes to generate some test data that kinda-sorta resembles our production 
data. If anyone has a better strategy I'm willing to do a little legwork 
testing other data sets.

[~kturner] The parts of the key are stored on the heap somewhere, so the 
problem of row equality is somewhat different than the problem of comparing two 
contiguous byte arrays. That said, maybe there would be benefits to storing all 
the pieces of the Key in a single byte array, and maintaining indices into it 
to track the individual parts, rather than several smaller arrays... That's a 
big refactor, though, for an unknown change in performance.

I think it would be worth revisiting the comparison mechanism in isEqual, too, 
doing something like the Unsafe method used in Hadoop's FastByteComparisons 
class but going in reverse. The CPU's speculative prefetch should work in 
either direction, but doing the comparison byte-at-a-time is going to be more 
expensive than the 64-bit comparisons that FastByteComparisons does. But that's 
a topic for another ticket ;)

> accumulo.core.data.Key.equals(Key, PartialKey) improvement
> --
>
> Key: ACCUMULO-4468
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4468
> Project: Accumulo
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.8.0
>Reporter: Will Murnane
>Priority: Trivial
>  Labels: newbie, performance
> Attachments: benchmark.tar.gz, key_comparison.patch
>
>
> In the Key.equals(Key, PartialKey) overload, the current method compares 
> starting at the beginning of the key, and works its way toward the end. This 
> functions correctly, of course, but one of the typical uses of this method is 
> to compare adjacent rows to break them into larger chunks. For example, 
> accumulo.core.iterators.Combiner repeatedly calls this method with subsequent 
> pairs of keys.
> I have a patch which reverses the comparison order. That is, if the method is 
> called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, 
> and finally row. This (marginally) improves the speed of comparisons in the 
> relatively common case where only the last part is changing, with less 
> complex code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement

2016-09-21 Thread Will Murnane (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Murnane updated ACCUMULO-4468:
---
Attachment: benchmark.tar.gz

mvn clean install
java -jar target/benchmark.jar -help

> accumulo.core.data.Key.equals(Key, PartialKey) improvement
> --
>
> Key: ACCUMULO-4468
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4468
> Project: Accumulo
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.8.0
>Reporter: Will Murnane
>Priority: Trivial
>  Labels: newbie, performance
> Attachments: benchmark.tar.gz, key_comparison.patch
>
>
> In the Key.equals(Key, PartialKey) overload, the current method compares 
> starting at the beginning of the key, and works its way toward the end. This 
> functions correctly, of course, but one of the typical uses of this method is 
> to compare adjacent rows to break them into larger chunks. For example, 
> accumulo.core.iterators.Combiner repeatedly calls this method with subsequent 
> pairs of keys.
> I have a patch which reverses the comparison order. That is, if the method is 
> called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, 
> and finally row. This (marginally) improves the speed of comparisons in the 
> relatively common case where only the last part is changing, with less 
> complex code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement

2016-09-21 Thread Will Murnane (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511154#comment-15511154
 ] 

Will Murnane commented on ACCUMULO-4468:


Responding to [~ctubbsii]'s points first:

# Most things in Accumulo are dealt with in sorted order, so I think it's 
reasonable to optimize for the case where things are sorted. Whether this is 
true in the general case is a function of how one calls the function, but it's 
not a change to the functionality of the code, just its performance. All the 
places I can see it used in Accumulo, the keys that are being compared are 
coming from SortedKeyValueIterator, so it's reasonable that we're comparing 
keys which are sorted, and the new behavior will be an improvement.
# I ran some tests with JMH to see what the difference is like. I can post the 
whole project, but here are some selected results. There are three functions 
being compared. I implemented a subclass of Key which has new methods: my 
proposed equals() function is called customWill, the original equals() copied 
to the new class is called customVanilla, and the original equals() in the 
original Key class is called standardEquals. The numbers to compare are really 
the two custom* ones; the standardEquals value is given just to show that it's 
in the same ballpark. \\ The test data set is generated before the benchmark 
begins, and contains 1m keys to go through, calling previous.equals(next, 
ROW_COLFAM). This is a worst-case scenario for sorted input, because the most 
that it can manage to avoid doing (as compared to the current implementation) 
is comparing the row values.
## I generated rows whose rowID are all equal, and the column family changes 
every key.
{noformat}
BenchmarkMode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  46.320 ± 3.803  ops/s
MyBenchmark.customWill  thrpt   30  88.349 ± 2.723  ops/s
MyBenchmark.standardEquals  thrpt   30  36.736 ± 0.883  ops/s
{noformat}
## I generated rows whose rowID are all equal, and the column family changes 
every 3 keys.
{noformat}
BenchmarkMode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  30.684 ± 1.258  ops/s
MyBenchmark.customWill  thrpt   30  34.292 ± 1.339  ops/s
MyBenchmark.standardEquals  thrpt   30  27.277 ± 0.984  ops/s
{noformat}
## I generated keys whose rowID are all equal, and the column family changes 
every 5 keys. 
{noformat}
BenchmarkMode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  27.195 ± 0.895  ops/s
MyBenchmark.customWill  thrpt   30  30.048 ± 0.838  ops/s
MyBenchmark.standardEquals  thrpt   30  25.044 ± 0.731  ops/s
{noformat}
## Finally, I generated keys whose rowID are all equal, and the column family 
changes every 1000 keys.
{noformat}
BenchmarkMode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  23.447 ± 1.010  ops/s
MyBenchmark.customWill  thrpt   30  23.427 ± 0.371  ops/s
MyBenchmark.standardEquals  thrpt   30  22.356 ± 0.192  ops/s
{noformat}
# As a user of the API, I'd rather not have to think about equalsForward() 
versus equalsBackward(). Maybe add an optional flag to specify direction of 
comparison, for 
# I agree in general, but I would argue that the code of the current equals() 
method is messier and harder to read. It repeats the same code quite a bit. I 
wrote a comment in the patch remarking on the fact that there's fallthrough and 
it's used intentionally, to try to prevent future confusion.

[~elserj] I agree that any change should start with prejudice against it. 
However, I think the numbers above prove my case: when keys are presented in 
sorted order, which happens often in Accumulo, the proposed method of comparing 
is slightly but noticeably faster. The degree of improvement depends on the 
data, but it doesn't perform worse than the current solution in any case that I 
tested.

> accumulo.core.data.Key.equals(Key, PartialKey) improvement
> --
>
> Key: ACCUMULO-4468
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4468
> Project: Accumulo
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.8.0
>Reporter: Will Murnane
>Priority: Trivial
>  Labels: newbie, performance
> Attachments: key_comparison.patch
>
>
> In the Key.equals(Key, PartialKey) overload, the current method compares 
> starting at the beginning of the key, and works its way toward the end. This 
> functions correctly, of course, but one of the typical uses of this method is 
> to compare adjacent rows to break them into larger chunks. For example, 
> accumulo.core.iterators.Combiner repeatedly calls this method with subsequent 
> pairs of keys.
> I have a patch which reverses 

[jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement

2016-09-21 Thread Will Murnane (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510679#comment-15510679
 ] 

Will Murnane commented on ACCUMULO-4468:


bq. Do you have other examples of where this might be used in a tight loop?
I think there are lots of other examples in Accumulo itself; for example, in 
iterators.system.DeletingIterator and iterators.user.TransformingIterator, this 
method is called in a loop. In our use case, we're doing a similar thing: 
building a larger object out of multiple rows, by finding groups of rows which 
are equal under ROW_COLFAM. When each object is built from only a few rows, the 
CF equality comparison returns false pretty often (which is to be expected), 
but only after comparing row IDs, which are always the same in practice.

bq. How did you test this? What types of numbers did you see?
I haven't been able to install it on a cluster to test. The test suite does 
pass with this patch applied. I think it's a minor change; in the "rows are 
equal" case the same amount of work is done as with the existing code, although 
the parts are accessed in the opposite order. They're still compared 
mostly-in-order, as isEqual does, but the comment in that function was 
inspiration to try reversing the comparison order.

Aside from performance, the code seems cleaner to me: there's no more 
repetition of e.g. the check of row equality. The bytecode (with Oracle javac 
1.8.0_92) is substantially smaller: 389 bytes versus 167.

bq. Why not consolidate this to:
That's fine with me. I just wrote all the cases to look the same, instead of 
having a "special case" for the last comparison made. If some special work were 
required in the compare-equal case, it could go before the return statement.

> accumulo.core.data.Key.equals(Key, PartialKey) improvement
> --
>
> Key: ACCUMULO-4468
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4468
> Project: Accumulo
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.8.0
>Reporter: Will Murnane
>Priority: Trivial
>  Labels: newbie, performance
> Attachments: key_comparison.patch
>
>
> In the Key.equals(Key, PartialKey) overload, the current method compares 
> starting at the beginning of the key, and works its way toward the end. This 
> functions correctly, of course, but one of the typical uses of this method is 
> to compare adjacent rows to break them into larger chunks. For example, 
> accumulo.core.iterators.Combiner repeatedly calls this method with subsequent 
> pairs of keys.
> I have a patch which reverses the comparison order. That is, if the method is 
> called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, 
> and finally row. This (marginally) improves the speed of comparisons in the 
> relatively common case where only the last part is changing, with less 
> complex code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement

2016-09-21 Thread Will Murnane (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Murnane updated ACCUMULO-4468:
---
Attachment: key_comparison.patch

> accumulo.core.data.Key.equals(Key, PartialKey) improvement
> --
>
> Key: ACCUMULO-4468
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4468
> Project: Accumulo
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.8.0
>Reporter: Will Murnane
>Priority: Trivial
>  Labels: newbie, performance
> Attachments: key_comparison.patch
>
>
> In the Key.equals(Key, PartialKey) overload, the current method compares 
> starting at the beginning of the key, and works its way toward the end. This 
> functions correctly, of course, but one of the typical uses of this method is 
> to compare adjacent rows to break them into larger chunks. For example, 
> accumulo.core.iterators.Combiner repeatedly calls this method with subsequent 
> pairs of keys.
> I have a patch which reverses the comparison order. That is, if the method is 
> called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, 
> and finally row. This (marginally) improves the speed of comparisons in the 
> relatively common case where only the last part is changing, with less 
> complex code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement

2016-09-21 Thread Will Murnane (JIRA)
Will Murnane created ACCUMULO-4468:
--

 Summary: accumulo.core.data.Key.equals(Key, PartialKey) improvement
 Key: ACCUMULO-4468
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4468
 Project: Accumulo
  Issue Type: Improvement
  Components: core
Affects Versions: 1.8.0
Reporter: Will Murnane
Priority: Trivial


In the Key.equals(Key, PartialKey) overload, the current method compares 
starting at the beginning of the key, and works its way toward the end. This 
functions correctly, of course, but one of the typical uses of this method is 
to compare adjacent rows to break them into larger chunks. For example, 
accumulo.core.iterators.Combiner repeatedly calls this method with subsequent 
pairs of keys.

I have a patch which reverses the comparison order. That is, if the method is 
called with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, and 
finally row. This (marginally) improves the speed of comparisons in the 
relatively common case where only the last part is changing, with less complex 
code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)