[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160318348 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java --- @@ -0,0 +1,605 @@ +/* + * Licensed

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-09 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 Great job guys! Also, check through the spam of your public github email address for a small gift @dongjoon-hyun @cloud-fan @viirya @kiszk @HyukjinKwon @mmccline

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-07 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160078679 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java --- @@ -0,0 +1,503

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-07 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160084073 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java --- @@ -0,0 +1,503

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 @cloud-fan Oh you are right, it is indeed byte[][]. The BytesColumnVector has separate per-row offset and length vectors/arrays, which seemed to indicate that it would be contiguous block

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 @dongjoon-hyun Great, so it is bit faster with putX, but not that much. I'm still concerned how well the big nextBatch() method gets optimized; JVM can bail out of optimizing complex

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160094778 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java --- @@ -0,0 +1,482

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160108646 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java --- @@ -0,0 +1,510

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-07 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160075503 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java --- @@ -0,0 +1,503

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-07 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160082127 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java --- @@ -0,0 +1,503

[GitHub] spark pull request #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on a diff in the pull request: https://github.com/apache/spark/pull/19943#discussion_r160088081 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java --- @@ -0,0 +1,503

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 @dongjoon-hyun Ok thanks. It is pity that the single buffer cannot be used, would have reduced number of arraycopy() calls by 5 orders of magnitude.. Btw have you tested the inlining behaviour

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 @dongjoon-hyun Thanks. I don't think it matters if nextBatch() is inlined or not. I think what matters is 1) how the putX() etc methods calls inside the tight loops are inlined and 2) how complex

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 @dongjoon-hyun the nextBatch() is invoked 4096x less often than the main copy loops, so it doesn't matter much

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 @dongjoon-hyun Thank you for testing the split methods. If anything the benchmark results look couple of percent slower now? Oh well, at least it is good to know that your code is as fast as it can

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2018-01-08 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 @dongjoon-hyun It is possible that the "multiple byte arrays" case happens only in write side when consumer code explicitly does it, and it is fine to use the single byte array an

[GitHub] spark issue #19943: [SPARK-16060][SQL] Support Vectorized ORC Reader

2017-12-26 Thread henrify
Github user henrify commented on the issue: https://github.com/apache/spark/pull/19943 If i've understood Spark development process correctly, the 2.3 branch cut date is in couple of days, and if this PR doesn't get merged to master real soon, it'll have to wait until 2.4, about 6