[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

2018-09-11 Thread SongYadong
Github user SongYadong closed the pull request at:

https://github.com/apache/spark/pull/22348


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

2018-09-10 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22348#discussion_r216405506
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
@@ -154,8 +159,6 @@ public void close() throws IOException {
 
   @Override
   public boolean nextKeyValue() throws IOException {
-resultBatch();
--- End diff --

Ah, i see. Thank you for explanation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

2018-09-07 Thread SongYadong
Github user SongYadong commented on a diff in the pull request:

https://github.com/apache/spark/pull/22348#discussion_r215870108
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
@@ -154,8 +159,6 @@ public void close() throws IOException {
 
   @Override
   public boolean nextKeyValue() throws IOException {
-resultBatch();
--- End diff --

Thanks for your review.
calling sequence is optional. `initBatch()` was moved to `nextBatch()`, if 
`hasNext()` called before `initBatch()`, eventually `initBatch()` will be 
called first of all in `nextBatch()`:
```
  public boolean nextBatch() throws IOException {
 if (columnarBatch == null) initBatch();
```

Orc has a much simple way: 
```
  public boolean nextKeyValue() throws IOException {
return nextBatch();
  }
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

2018-09-06 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22348#discussion_r215860694
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
@@ -154,8 +159,6 @@ public void close() throws IOException {
 
   @Override
   public boolean nextKeyValue() throws IOException {
-resultBatch();
--- End diff --

I like this direction to reduce overhead. I confirmed that the current code 
calls `initBatch()` before calling `nextKeyValue()`, too.  
My question is that is calling sequence is mandatory or optional. In the 
future, I wonder a developer may call this method thru `hasNext()` before 
calling `initBatch()`.

How about moving `if-statement` into here like? Does this help performance 
improvement?
```
if (columnarBatch == null) resultBatch();
```
I am curious how orc reader handle this.
cc @dongjoon-hyun @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org