[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Hey - yeah definitely a real concern as it needs driver heap to scale with the size of the metadata of the table you are going to read in. We could be creative to add heuristics around reading the metadata at all if it is too large or add the ability to spill the existing metadata to disk. Do you have any preferences/thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15835 This creates huge problems when the table is big doesn't it? We just did a big change to get rid of the per table file status cache, because its existence made Spark unstable with dealing with tables (tables with a lot of files). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 I've updated the structure of the PR to change caching to be global across instances of FileFormat, have expiry, and reuse known filters. Here is a new benchmark to highlight the filter re-use (I flipped the results to make it easier to read) ``` withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") { withTempPath { path => spark.range(0, 200, 1, 200) .write.parquet(path.getCanonicalPath) val benchmark = new Benchmark("Parquet partition pruning benchmark", 200) benchmark.addCase("Parquet partition pruning disabled") { iter => withSQLConf(SQLConf.PARQUET_PARTITION_PRUNING_ENABLED.key -> "false") { var df = spark.read.parquet(path.getCanonicalPath).filter("id = 0") for (i <- 1 to 10) { df = df.filter(s"id < $i") df.collect() } } } benchmark.addCase("Parquet partition pruning enabled") { iter => var df = spark.read.parquet(path.getCanonicalPath).filter("id = 0") for (i <- 1 to 10) { df = df.filter(s"id < $i") df.collect() } } benchmark.run() } } ``` ``` Running benchmark: Parquet partition pruning benchmark Running case: Parquet partition pruning disabled Stopped after 2 iterations, 8744 ms Running case: Parquet partition pruning enabled Stopped after 5 iterations, 2187 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz Parquet partition pruning benchmark: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet partition pruning disabled4332 / 4372 0.0 21659450.2 1.0X Parquet partition pruning enabled 399 / 438 0.0 1995877.1 10.9X ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 I guess we should ping @liancheng as he was reviewing the previous one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 @rxin @HyukjinKwon ready for more review on my end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 I've pushed up the ability to configure this feature being enabled as well. Here is a benchmark when writing out 200 files with this code: ``` withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") { withTempPath { path => spark.range(0, numPartitions, 1, numPartitions) .write.parquet(path.getCanonicalPath) val benchmark = new Benchmark("Parquet partition pruning benchmark", numPartitions) benchmark.addCase("Parquet partition pruning enabled") { iter => spark.read.parquet(path.getCanonicalPath).filter("id = 0").collect() } benchmark.addCase("Parquet partition pruning disabled") { iter => withSQLConf(SQLConf.PARQUET_PARTITION_PRUNING_ENABLED.key -> "false") { spark.read.parquet(path.getCanonicalPath).filter("id = 0").collect() } } benchmark.run() } } ``` ``` Running benchmark: Parquet partition pruning benchmark Running case: Parquet partition pruning enabled Stopped after 12 iterations, 2049 ms Running case: Parquet partition pruning disabled Stopped after 5 iterations, 2177 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz Parquet partition pruning benchmark: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet partition pruning enabled 145 / 171 0.0 723119.3 1.0X Parquet partition pruning disabled 414 / 436 0.0 2070279.4 0.3X Process finished with exit code 0 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Cool - I've added the caching, fixed style issues, and added pruning to the bucketed reads. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Ah awesome thanks - my IDE settings must not be properly catching the style - will fix and make sure this doesn't happen in the future. Also just took a look at the bucketing. I've implemented the same logic, which will prune the file if no splits exist, but it is still all or nothing for a given file because it launches tasks based on bucket. Commit coming soon. Thanks again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 Ah, I missed the last comment from the old PR. Okay, we can make this shaped nicer. BTW, Spark collects small partitions for each task so I guess this would not introduce a lot of tasks always but yes, I guess it is still a valid point to reduce the number of tasks. Right, I am fine with this. I thought the original PR was taken over without the courtesy of giving a notification or talking about this ahead. For the benchmark, I have a PR with a benchmark in a PR, 14660, which I also referred from other PRs. I have just few minor notes which are, maybe `Closes #14649` can be added at the end of the PR description so that the merge script from committers could close the original one if this one gets merged. Another one is, there are some style guide lines I usually refer which are [wiki](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [databricks/scala-style-guide](https://github.com/databricks/scala-style-guide). I will leave some comments on the changed files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 Hey @HyukjinKwon - appreciate the feedback! Re: file touching - If I add the cache to the `_metadata` file, then this PR will end up touching at most one file per rootPath driver side (generally just one total). Re: files v.s. splits - The main difference when pruning splits instead of files is when you have larger files you will end up spawning tasks that immediately will be filtered out executor-side after grabbing the footer. For simplicity, if we have maxSplitBytes == Parquet row group size then a single hit in a file will end up spawning a task for every row group even if the file only has one matching block. This overhead can end up being expensive in a setup w/ dynamicAllocation and multi-tenancy. I generally wish to reduce the total number of tasks. Re: bucketed - yep sorry, will fix! Re: benchmarks - yeah totally happy to poke around and make some benchmarks. ParquetReadBenchmark is the appropriate place I suppose? Re: old PR - the code there is out of date and I don't believe Andrew is actively working on it based on his last comment. This is the follow up to original ER in what I believe is a more comprehensive and reliable way. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 I think we should cc @liancheng as well who is insightful in this area. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15835 Hi @pwoody, So, if I understood this correctly, the original PR only filters out the files to touch ahead but this one proposes also to filter splits via offsets from Parquet's metadata in driver-side, right? IIRC, each task will already read the footer before it actually starts to read in executors and then will drop the splits at Parquet-side. This happens fine in both the Spark's vectorized parquet reader and normal parquet reader too. It might be worth reducing the files to touch for the reason I and other guys described in the original PR but I am not sure of pruning splits ahead. Another potential problem I see here is It seems it does not consider bucketed table read whereas the original PR does. In addition, I guess we really need a benchmark for the proposal to improve the performance. It is fine if I am wrong and this PR has a benchmark showing the performance improvement with a reasonable explanation. Lastly, I guess it is a followup including the changes proposed in 14649. Maybe, we can wait until that is merged before submitting a followup. I guess it is being reviewed and I think @andreweduffy is still echoing fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 There is a performance issue here in that we re-fetch the metadata file for each file. My understanding is that FileFormat is meant to be stateless, but if we can add in a cache then that would be very useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15835 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 @andreweduffy @robert3005 @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org