[GitHub] [spark] parthchandra edited a comment on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
parthchandra edited a comment on pull request #35262: URL: https://github.com/apache/spark/pull/35262#issuecomment-1061141261 @sunchao I merged your changes into the PR. Also updated the benchmarks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] parthchandra edited a comment on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
parthchandra edited a comment on pull request #35262: URL: https://github.com/apache/spark/pull/35262#issuecomment-1023832057 @LuciferYang you were right, the vector needed initialization. Also added a memory mode parameter for the unit test. I increased the size of the data generated to reproduce the issue and then applied the fix. Also ran the benchmark with the off heap enabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] parthchandra edited a comment on pull request #35262: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
parthchandra edited a comment on pull request #35262: URL: https://github.com/apache/spark/pull/35262#issuecomment-1022462107 > > Updated the JDK 8 benchmark results as well. > > After comparing the new bench data, I find that the data corresponding to `Parquet Data Page V2` in the two test cases `String with Nulls Scan (50.0%) ` and `String with Nulls Scan (95.0%)` is relatively slower than the previous pr (although the CPU frequency of the testing machine is reduced): > > before after > String with Nulls Scan (50.0%) 145.7 ns/per row228.7 ns/per row > String with Nulls Scan (95.0%) 25.2 ns/per row 77.9 ns/per row It's hard to reasonably compare the numbers across runs (even though the difference is substantial) because of the difference in the environment. Incidentally, with nulls, the decoder doesn't even get called so such a precipitous drop is somewhat suspicious. _And_ it appears that the vectorized decoder is being called one record at a time (this may not be a problem because the decoding has mostly been done though not written into the output vector). I made a change to determine runs of null/non-null values and increase the number of values being written out to the output vector in each call, but saw no significant change (running benchmark on laptop). See: https://github.com/apache/spark/blob/6e64e9252a821651a8984babfac79a9ea433/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L237 Let me do a profile run to see if any obvious bottlenecks stand out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org