Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160318348
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java
---
@@ -0,0 +1,605 @@
+/*
+ * Licensed
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
Great job guys! Also, check through the spam of your public github email
address for a small gift @dongjoon-hyun @cloud-fan @viirya @kiszk @HyukjinKwon
@mmccline
Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160078679
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java
---
@@ -0,0 +1,503
Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160084073
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java
---
@@ -0,0 +1,503
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
@cloud-fan Oh you are right, it is indeed byte[][]. The BytesColumnVector
has separate per-row offset and length vectors/arrays, which seemed to indicate
that it would be contiguous block
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
@dongjoon-hyun Great, so it is bit faster with putX, but not that much.
I'm still concerned how well the big nextBatch() method gets optimized; JVM
can bail out of optimizing complex
Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160094778
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java
---
@@ -0,0 +1,482
Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160108646
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java
---
@@ -0,0 +1,510
Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160075503
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java
---
@@ -0,0 +1,503
Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160082127
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java
---
@@ -0,0 +1,503
Github user henrify commented on a diff in the pull request:
https://github.com/apache/spark/pull/19943#discussion_r160088081
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/JavaOrcColumnarBatchReader.java
---
@@ -0,0 +1,503
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
@dongjoon-hyun Ok thanks. It is pity that the single buffer cannot be used,
would have reduced number of arraycopy() calls by 5 orders of magnitude.. Btw
have you tested the inlining behaviour
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
@dongjoon-hyun Thanks. I don't think it matters if nextBatch() is inlined
or not. I think what matters is 1) how the putX() etc methods calls inside the
tight loops are inlined and 2) how complex
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
@dongjoon-hyun the nextBatch() is invoked 4096x less often than the main
copy loops, so it doesn't matter much
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
@dongjoon-hyun Thank you for testing the split methods. If anything the
benchmark results look couple of percent slower now? Oh well, at least it is
good to know that your code is as fast as it can
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
@dongjoon-hyun It is possible that the "multiple byte arrays" case happens
only in write side when consumer code explicitly does it, and it is fine to use
the single byte array an
Github user henrify commented on the issue:
https://github.com/apache/spark/pull/19943
If i've understood Spark development process correctly, the 2.3 branch cut
date is in couple of days, and if this PR doesn't get merged to master real
soon, it'll have to wait until 2.4, about 6
17 matches
Mail list logo