[GitHub] [spark] parthchandra edited a comment on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-03-07 Thread GitBox


parthchandra edited a comment on pull request #35262:
URL: https://github.com/apache/spark/pull/35262#issuecomment-1061141261


   @sunchao  I merged your changes into the PR. Also updated the benchmarks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] parthchandra edited a comment on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-27 Thread GitBox


parthchandra edited a comment on pull request #35262:
URL: https://github.com/apache/spark/pull/35262#issuecomment-1023832057


   @LuciferYang you were right, the vector needed initialization. 
   Also added a memory mode parameter for the unit test. I increased the size 
of the data generated to reproduce the issue and then applied the fix. Also ran 
the benchmark with the off heap enabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] parthchandra edited a comment on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-26 Thread GitBox


parthchandra edited a comment on pull request #35262:
URL: https://github.com/apache/spark/pull/35262#issuecomment-1022462107


   > > Updated the JDK 8 benchmark results as well.
   > 
   > After comparing the new bench data, I find that the data corresponding to 
`Parquet Data Page V2` in the two test cases `String with Nulls Scan (50.0%) ` 
and `String with Nulls Scan (95.0%)` is relatively slower than the previous pr 
(although the CPU frequency of the testing machine is reduced):
   > 
   > before after
   > String with Nulls Scan (50.0%) 145.7 ns/per row228.7 ns/per row
   > String with Nulls Scan (95.0%) 25.2 ns/per row 77.9 ns/per row
   
   It's hard to reasonably compare the numbers across runs (even though the 
difference is substantial) because of the difference in the environment. 
   Incidentally, with nulls, the decoder doesn't even get called so such a 
precipitous drop is somewhat suspicious. _And_ it appears that the vectorized 
decoder is being called one record at a time (this may not be a problem because 
the decoding has mostly been done though not written into the output vector).
   I made a change to determine runs of null/non-null values and increase the 
number of values being written out to the output vector in each call, but saw 
no significant change (running benchmark on laptop).
   See: 
https://github.com/apache/spark/blob/6e64e9252a821651a8984babfac79a9ea433/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L237
   
   Let me do a profile run to see if any obvious bottlenecks stand out. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org