[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2020-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29721:
--
Affects Version/s: 3.0.0

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Kai Kang
>Priority: Major
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2020-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29721:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Kai Kang
>Priority: Major
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29721:
-
Priority: Major  (was: Critical)

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Major
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
{noformat}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

// pruned, only loading itemId
// ReadSchema: struct>>
read.select($"items.itemId").explain(true) 

// not pruned, loading both itemId 
// ReadSchema: struct>>
read.select(explode($"items.itemId")).explain(true) and itemData
{noformat}
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
{noformat}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{noformat}
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
{noformat}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{noformat}
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

bq. val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {quote}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {quote}
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

bq. val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{quote}val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> bq. val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{quote}val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {quote}val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote}
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")

 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
>  
> Part 2: reading it back and explaining the queries
> {{val read = spark.table("persisted")}}
>  \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
>  \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
> itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
> loading both itemId and itemData}}{{ }}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}{{import spark.implicits._}}
val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
 {{val df = spark.read.json(Seq(jsonStr).toDS)}}
 {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
 spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
 read.select($"items.itemId").explain(true) // pruned, only loading itemId

read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{quote}
 

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}{{import spark.implicits._}}
{{val jsonStr = """{}}
{{ "items": [}}
{{  {}}
{{    "itemId": 1,}}
{{    "itemData": "a"}}
{{  },}}
{{  {}}
{{    "itemId": 1,}}
{{    "itemData": "b"}}
{{  }}}
{{ ]}"""}}
{{val df = spark.read.json(Seq(jsonStr).toDS)}}
{{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId

read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{quote}
 

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {quote}{{import spark.implicits._}}
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
>  {{val df = spark.read.json(Seq(jsonStr).toDS)}}
>  {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
> {quote}
> Part 2: reading it back and explaining the queries
> {quote}val read = spark.table("persisted")
>  spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
>  read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
> {quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")

 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{{val jsonStr = """{}}
{{ "items": [}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "a"}}
{{ },}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "b"}}
{{ }}}
{{ ]}}
{{}"""}}

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
>  
>  
> Part 2: reading it back and explaining the queries
> {{val read = spark.table("persisted")}}
>  \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
>  \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
> itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
> loading both itemId and itemData}}{{ }}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{{val jsonStr = """{}}
{{ "items": [}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "a"}}
{{ },}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "b"}}
{{ }}}
{{ ]}}
{{}"""}}

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}{{import spark.implicits._}}
val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
 {{val df = spark.read.json(Seq(jsonStr).toDS)}}
 {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
 spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
 read.select($"items.itemId").explain(true) // pruned, only loading itemId

read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{quote}
 

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {{val jsonStr = """{}}
> {{ "items": [}}
> {{ {}}
> {{ "itemId": 1,}}
> {{ "itemData": "a"}}
> {{ },}}
> {{ {}}
> {{ "itemId": 1,}}
> {{ "itemData": "b"}}
> {{ }}}
> {{ ]}}
> {{}"""}}
> Part 2: reading it back and explaining the queries
> {{val read = spark.table("persisted")}}
> {{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
> {{ read.select($"items.itemId").explain(true) // pruned, only loading 
> itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
> loading both itemId and itemData}}{{ }}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org