[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

2018-09-10 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/22357
  
@mallman regarding

"But how do we know that pushing down IsNotNull(employer) does not negate 
that instruction? "

 isn't it pretty obvious that when you read 'employer.id' with predicate 
'employer.id = 0' then you can safely skip all records where 'employer' is 
null? If 'employer' is null then 'employer.id' is null as well so it would 
never satisfy provided predicate

Of course I agree that pushing down for 'employer.id' might be even better 
(if it's possible)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-06-15 Thread DaimonPl
Github user DaimonPl commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r195704995
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1288,8 +1288,18 @@ object SQLConf {
 "issues. Turn on this config to insert a local sort before 
actually doing repartition " +
 "to generate consistent repartition results. The performance of 
repartition() may go " +
 "down since we insert extra local sort before it.")
+.booleanConf
+.createWithDefault(true)
+
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data. Currently Parquet is the 
only data source that " +
+"implements this optimization.")
   .booleanConf
-  .createWithDefault(true)
+  .createWithDefault(false)
--- End diff --

how about enabling it as default? there should be enough time to find any 
unexpected problems with 2.4.0

+ nested column pruning would be enabled during for all other automatic 
tests


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-02-05 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
So if it's not going to be included in `2.3.0` maybe we could change 
`spark.sql.nestedSchemaPruning.enabled` to default `true` ? I hope that this 
time this PR could be finalized at the early stage of `2.4.0` so there would be 
plenty of time to fix any unforseen problems?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-02 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
New year, new review? ;)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-19 Thread DaimonPl
Github user DaimonPl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r151917066
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -961,6 +961,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data.")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

So maybe at least make it true for some core sql/parquet test suits?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-16 Thread DaimonPl
Github user DaimonPl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r151485646
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -961,6 +961,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data.")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

Nope :( maybe @viirya can give input about it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-16 Thread DaimonPl
Github user DaimonPl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r151476679
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -961,6 +961,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data.")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

Just to be clear. I mean to make it default true for all tests in spark. 
Not only those explicitly related to this feature :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-16 Thread DaimonPl
Github user DaimonPl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r151344821
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -961,6 +961,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data.")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

How about making it default true for tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-26 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman how about finalizing it as is? IMHO performance improvements are 
worth more than (possibly) redundant workaround - it could be cleaned later


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-09 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman @viirya from my understanding current workaround is for case when 
reading columns which are not in file schema

> Parquet-mr will throw an exception if we try to read a superset of the 
file's schema.

Isn't it somehow dependent on schema evolution setting? 
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging

> Since schema merging is a relatively expensive operation, and is not a 
necessity in most cases, we turned it off by default starting from 1.5.0. You 
may enable it by
> * setting data source option mergeSchema to true when reading Parquet 
files (as shown in the examples below), or
> * setting the global SQL option spark.sql.parquet.mergeSchema to true.

Wouldn't it work fine with `spark.sql.parquet.mergeSchema` enabled?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-09-21 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman how about adding comment explaining why such workaround was done + 
bug number in parquet-mr ? So in future once that bug is fixed, code can be 
cleaned.

Also maybe it's time to remove "DO NOT MERGE" from title? As I understand 
most of comments were addressed :)

Thank you very much for work on this feature. I must admit that we are 
looking forward to have this merged. For us this will be most important 
improvement in Spark 2.3.0 (I hope it will be part of 2.3.0 :) )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-09-08 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman (just pure curiosity :) ) how is it possible that this NPE was not 
found by automated tests? There's no single test case in entire spark suite 
which verifies scenarios like that described by @snir ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org