[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22357


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216776055
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

Instead of reading any arbitrary field of simple type (which may not exist 
if it's a deeply nested struct), I think we should implement the pushdown with 
complex type in parquet with similar logic, and let parquet reader handle it. 

@viirya Can you create a followup JIRA for this? 

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216714387
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

Yeah, I'm okay with leaving it as-is.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216708046
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

Btw, I think this is not seen as schema pruning case in the sense of 
original PR, so maybe we can leave it as it for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216702003
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

Currently under this PR, `name` will be fully read. This is not perfect. 
However, to pick one arbitrary field from `name` sounds a little bit hacky to 
me. WDYT? cc @dbtsai @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216694201
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +163,60 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query1 = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query1, "struct>")
+checkAnswer(query1, Row("Jane") :: Nil)
+
+val query2 = sql("select name.first, name.last from contacts where 
name.first = 'Jane'")
+checkScan(query2, "struct>")
+checkAnswer(query2, Row("Jane", "Doe") :: Nil)
+
+val query3 = sql("select name.first from contacts " +
+  "where employer.company.name = 'abc' and p = 1")
+checkScan(query3, "struct," +
+  "employer:struct>>")
+checkAnswer(query3, Row("Jane") :: Nil)
+
+val query4 = sql("select name.first, employer.company.name from 
contacts " +
+  "where employer.company is not null and p = 1")
+checkScan(query4, "struct," +
+  "employer:struct>>")
+checkAnswer(query4, Row("Jane", "abc") :: Nil)
+  }
+
+  testSchemaPruning("select nullable complex field and having is null 
predicate") {
--- End diff --

Oops, yes, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216686762
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +163,60 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query1 = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query1, "struct>")
+checkAnswer(query1, Row("Jane") :: Nil)
+
+val query2 = sql("select name.first, name.last from contacts where 
name.first = 'Jane'")
+checkScan(query2, "struct>")
+checkAnswer(query2, Row("Jane", "Doe") :: Nil)
+
+val query3 = sql("select name.first from contacts " +
+  "where employer.company.name = 'abc' and p = 1")
+checkScan(query3, "struct," +
+  "employer:struct>>")
+checkAnswer(query3, Row("Jane") :: Nil)
+
+val query4 = sql("select name.first, employer.company.name from 
contacts " +
+  "where employer.company is not null and p = 1")
+checkScan(query4, "struct," +
+  "employer:struct>>")
+checkAnswer(query4, Row("Jane", "abc") :: Nil)
+  }
+
+  testSchemaPruning("select nullable complex field and having is null 
predicate") {
--- End diff --

Do you mean `having is not null predicate`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216683076
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

@viirya, I see your point about the difference between a complex type being 
null and a subfield being null. So to answer the following query

select address from contacts where name is not null

do we need to read any of the fields in `name`? Or perhaps just read one 
arbitrary field of simple type, like `name.first`? That's surprising, but I'm 
starting to believe it's true.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216601125
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

A complex column is null and its fields are null are different. I think we 
don't need to read all the fields to check if the complex column is null. In 
other words, in above case, when we only read `employer.id` and it is null, the 
predicate `employer is not null` will still be true because it is a complex 
column containing a null field.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216565635
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

> I checked in the ParquetFilter, IsNotNull(employer) will be ignored since 
it's not a valid parquet filter as parquet doesn't support pushdown on the 
struct; thus, with this PR, this query will return wrong answer.

We may not worry about wrong answer from datasource like Parquet in 
predicate pushdown. As not all predicates are supported by pushdown, we always 
have a SparkSQL Filter on top of scan node to make sure to receive correct 
answer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216559045
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

For the first query, the constrain is `employer is not null`. 

When `employer.id` is not `null`, `employer` will always  not be `null`; as 
a result, this PR will work. 

However, when `employer.id` is `null`, `employer` can be `null` or 
`something`, so we need to check if `employer` is `something` to return a null 
of `employer.id`.

I checked in the `ParquetFilter`, `IsNotNull(employer)` will be ignored 
since it's not a valid parquet filter as parquet doesn't support pushdown on 
the struct; thus, with this PR, this query will return wrong answer. 

I think in this scenario, as @mallman suggested, we might need to read the 
full data. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216545091
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

I'm having trouble accepting this, but perhaps I'm reading too much into it 
(or not enough). Let me illustrate with a couple of queries and their physical 
plans.

Assuming the data model in `ParquetSchemaPruningSuite.scala`, the physical 
plan for the query

select employer.id from contacts where employer is not null

is

```
== Physical Plan ==
*(1) Project [employer#36.id AS id#46]
+- *(1) Filter isnotnull(employer#36)
   +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: 
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
ReadSchema: struct>
```

The physical plan for the query

select employer.id from contacts where employer.id is not null

is

```
== Physical Plan ==
*(1) Project [employer#36.id AS id#47]
+- *(1) Filter (isnotnull(employer#36) && isnotnull(employer#36.id))
   +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: 
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
ReadSchema: struct>
```

The read schemata are the same, but the query filters are not. The file 
scan for the second query looks as I would expect, but the scan for the first 
query appears to only read `employer.id` even though it needs to check 
`employer is not null`. If it only reads `employer.id`, how does it check that 
`employer.company` is not null? Perhaps `employer.id` is null but 
`employer.company` is not null for some row...

I have run some tests to validate that this PR is returning the correct 
results for both queries, and it is. But I don't understand why.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216256434
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -199,6 +209,15 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
   case att: Attribute =>
 RootField(StructField(att.name, att.dataType, att.nullable), 
derivedFromAtt = true) :: Nil
   case SelectedField(field) => RootField(field, derivedFromAtt = 
false) :: Nil
+  // Root field accesses by `IsNotNull` and `IsNull` are special cases 
as the expressions
+  // don't actually use any nested fields. These root field accesses 
might be excluded later
+  // if there are any nested fields accesses in the query plan.
+  case IsNotNull(SelectedField(field)) =>
+RootField(field, derivedFromAtt = false, contentAccessed = false) 
:: Nil
+  case IsNull(SelectedField(field)) =>
+RootField(field, derivedFromAtt = false, contentAccessed = false) 
:: Nil
--- End diff --

@dbtsai The question you mentioned at 
https://github.com/apache/spark/pull/22357/files#r216204022 was addressed by 
this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216255549
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query1 = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query1, "struct>")
+checkAnswer(query1, Row("Jane") :: Nil)
+
+val query2 = sql("select name.first, name.last from contacts where 
name.first = 'Jane'")
+checkScan(query2, "struct>")
+checkAnswer(query2, Row("Jane", "Doe") :: Nil)
+
+val query3 = sql("select name.first from contacts " +
+  "where employer.company.name = 'abc' and p = 1")
--- End diff --

When there is a nested field access in the query like 
`employer.company.name`, then we don't need other fields inside 
`employ.company` other than `name`.

But if there is no such access but just `employer.company is not null` in 
where clause, it will read full schema of `employ.company`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216244620
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query1 = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query1, "struct>")
+checkAnswer(query1, Row("Jane") :: Nil)
+
+val query2 = sql("select name.first, name.last from contacts where 
name.first = 'Jane'")
+checkScan(query2, "struct>")
+checkAnswer(query2, Row("Jane", "Doe") :: Nil)
+
+val query3 = sql("select name.first from contacts " +
+  "where employer.company.name = 'abc' and p = 1")
--- End diff --

Added one query test for this case. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216218951
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -156,7 +161,7 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 // in the resulting schema may differ from their ordering in the 
logical relation's
 // original schema
 val mergedSchema = requestedRootFields
-  .map { case RootField(field, _) => StructType(Array(field)) }
+  .map { case RootField(field, _, _) => StructType(Array(field)) }
--- End diff --

Not a big deal but `.map { root: RootField => StructType(Array(root.field)) 
}` per https://github.com/databricks/scala-style-guide#pattern-matching


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216218409
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -196,6 +201,9 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
*/
   private def getRootFields(expr: Expression): Seq[RootField] = {
 expr match {
+  // Those expressions don't really use the nested fields of a root 
field.
+  case i@(IsNotNull(_: Attribute) | IsNull(_: Attribute)) =>
--- End diff --

nit: -> `i @ (IsNotNull(_: ...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216204022
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query1 = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query1, "struct>")
+checkAnswer(query1, Row("Jane") :: Nil)
+
+val query2 = sql("select name.first, name.last from contacts where 
name.first = 'Jane'")
+checkScan(query2, "struct>")
+checkAnswer(query2, Row("Jane", "Doe") :: Nil)
+
+val query3 = sql("select name.first from contacts " +
+  "where employer.company.name = 'abc' and p = 1")
--- End diff --

Let's say a user adds  `where employer.company is not null`, can we still 
read schema with `employer:struct>>` as we only 
mark `contentAccessed = false` when `IsNotNull` is on an attribute?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216202879
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,12 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+val (rootFields, optRootFields) = (projectionRootFields ++ 
filterRootFields)
+  .distinct.partition(_.contentAccessed)
--- End diff --

Some comments here please. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216200358
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.execution.datasources.parquet
 
-import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeReference, Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeReference, Expression, IsNotNull, IsNull, NamedExpression}
--- End diff --

which can be wildcard when there are more than 6 entities


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216199370
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.execution.datasources.parquet
 
-import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeReference, Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeReference, Expression, IsNotNull, IsNull, NamedExpression}
--- End diff --

line too long.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216199294
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -250,8 +258,9 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 }
 
   /**
-   * A "root" schema field (aka top-level, no-parent) and whether it was 
derived from
-   * an attribute or had a proper child.
+   * A "root" schema field (aka top-level, no-parent), whether it was 
derived from
+   * an attribute or had a proper child, and whether it was accessed with 
its content.
*/
-  private case class RootField(field: StructField, derivedFromAtt: Boolean)
+  private case class RootField(field: StructField, derivedFromAtt: Boolean,
+   contentAccessed: Boolean = true)
--- End diff --

Formatting and please elaborate the comment


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216198335
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -196,6 +201,9 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
*/
   private def getRootFields(expr: Expression): Seq[RootField] = {
 expr match {
+  // Those expressions don't really use the nested fields of a root 
field.
+  case i@(IsNotNull(_: Attribute) | IsNull(_: Attribute)) =>
+getRootFields(i.children(0)).map(_.copy(contentAccessed = false))
   case att: Attribute =>
 RootField(StructField(att.name, att.dataType, att.nullable), 
derivedFromAtt = true) :: Nil
   case SelectedField(field) => RootField(field, derivedFromAtt = 
false) :: Nil
--- End diff --

How about 

```scala
  case IsNotNull(_: Attribute) | IsNull(_: Attribute) =>
expr.children.flatMap(getRootFields).map(_.copy(contentAccessed = 
false))
  case _ =>
expr.children.flatMap(getRootFields)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r215899595
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +155,30 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query, "struct>")
+checkAnswer(query, Row("Jane") :: Nil)
--- End diff --

Yes. Added test case for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r215899485
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
*/
   private def getRootFields(expr: Expression): Seq[RootField] = {
 expr match {
+  case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty
--- End diff --

Thanks. This was a case I didn't test. Fixed it and added test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r215877729
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
*/
   private def getRootFields(expr: Expression): Seq[RootField] = {
 expr match {
+  case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty
--- End diff --

I mean, for instance, this case `select address is not null, name.last from 
contacts` it wouldn't work. I thought this is a quick bandaid fix to resolve a 
basic case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r215866501
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
*/
   private def getRootFields(expr: Expression): Seq[RootField] = {
 expr match {
+  case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty
--- End diff --

But the case mentioned here looks specific to the pushed filter itself. Can 
we add a simple test for project case as well?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r215864442
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
*/
   private def getRootFields(expr: Expression): Seq[RootField] = {
 expr match {
+  case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty
--- End diff --

If this is in projects, I think we also don't need to include all nested 
fields?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r215857463
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
*/
   private def getRootFields(expr: Expression): Seq[RootField] = {
 expr match {
+  case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty
--- End diff --

H .. shouldn't we exclude this for filters only in `identifyRootFields`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r215853912
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +155,30 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query, "struct>")
+checkAnswer(query, Row("Jane") :: Nil)
--- End diff --

can you add another tests that select `name.first` and `name.last,` and 
apply `where clause` on `name.first`. We should only read `name.first` and 
`name.last` without name.middle.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-06 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/22357

[SPARK-25363][SQL] Fix schema pruning in where clause by ignoring 
unnecessary root fields 

## What changes were proposed in this pull request?


Schema pruning doesn't work if nested column is used in where clause.

For example,
```
sql("select name.first from contacts where name.first = 'David'")

== Physical Plan ==
*(1) Project [name#19.first AS first#40]
+- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
   +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, 
PartitionFilters: [], 
PushedFilters: [IsNotNull(name)], ReadSchema: 
struct>
```

In above query plan, the scan node reads the entire schema of `name` column.

This issue is reported by:
https://github.com/apache/spark/pull/21320#issuecomment-419290197

The cause is that we infer a root field from expression `IsNotNull(name)`. 
However, for such expression, we don't really use the nested fields of this 
root field, so we can ignore the unnecessary nested fields.

## How was this patch tested?

Unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-25363

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22357.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22357


commit 3de6ee2c28e1242fead892f922d0053e31509a9f
Author: Liang-Chi Hsieh 
Date:   2018-09-07T05:06:07Z

Fix schema pruning when used in where clause.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org