[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/22357 @mallman regarding "But how do we know that pushing down IsNotNull(employer) does not negate that instruction? " isn't it pretty obvious that when you read 'employer.id' with predicate 'employer.id = 0' then you can safely skip all records where 'employer' is null? If 'employer' is null then 'employer.id' is null as well so it would never satisfy provided predicate Of course I agree that pushing down for 'employer.id' might be even better (if it's possible) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user DaimonPl commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r195704995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1288,8 +1288,18 @@ object SQLConf { "issues. Turn on this config to insert a local sort before actually doing repartition " + "to generate consistent repartition results. The performance of repartition() may go " + "down since we insert extra local sort before it.") +.booleanConf +.createWithDefault(true) + + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data. Currently Parquet is the only data source that " + +"implements this optimization.") .booleanConf - .createWithDefault(true) + .createWithDefault(false) --- End diff -- how about enabling it as default? there should be enough time to find any unexpected problems with 2.4.0 + nested column pruning would be enabled during for all other automatic tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 So if it's not going to be included in `2.3.0` maybe we could change `spark.sql.nestedSchemaPruning.enabled` to default `true` ? I hope that this time this PR could be finalized at the early stage of `2.4.0` so there would be plenty of time to fix any unforseen problems? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 New year, new review? ;) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r151917066 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -961,6 +961,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data.") + .booleanConf + .createWithDefault(true) --- End diff -- So maybe at least make it true for some core sql/parquet test suits? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r151485646 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -961,6 +961,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data.") + .booleanConf + .createWithDefault(true) --- End diff -- Nope :( maybe @viirya can give input about it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r151476679 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -961,6 +961,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data.") + .booleanConf + .createWithDefault(true) --- End diff -- Just to be clear. I mean to make it default true for all tests in spark. Not only those explicitly related to this feature :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r151344821 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -961,6 +961,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data.") + .booleanConf + .createWithDefault(true) --- End diff -- How about making it default true for tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman how about finalizing it as is? IMHO performance improvements are worth more than (possibly) redundant workaround - it could be cleaned later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman @viirya from my understanding current workaround is for case when reading columns which are not in file schema > Parquet-mr will throw an exception if we try to read a superset of the file's schema. Isn't it somehow dependent on schema evolution setting? http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging > Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by > * setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or > * setting the global SQL option spark.sql.parquet.mergeSchema to true. Wouldn't it work fine with `spark.sql.parquet.mergeSchema` enabled? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman how about adding comment explaining why such workaround was done + bug number in parquet-mr ? So in future once that bug is fixed, code can be cleaned. Also maybe it's time to remove "DO NOT MERGE" from title? As I understand most of comments were addressed :) Thank you very much for work on this feature. I must admit that we are looking forward to have this merged. For us this will be most important improvement in Spark 2.3.0 (I hope it will be part of 2.3.0 :) ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman (just pure curiosity :) ) how is it possible that this NPE was not found by automated tests? There's no single test case in entire spark suite which verifies scenarios like that described by @snir ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org