[jira] [Created] (SPARK-42879) Spark SQL reads unnecessary nested fields

2023-03-21 Thread Jiri Humpolicek (Jira)
Jiri Humpolicek created SPARK-42879:
---

 Summary: Spark SQL reads unnecessary nested fields
 Key: SPARK-42879
 URL: https://issues.apache.org/jira/browse/SPARK-42879
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.2
Reporter: Jiri Humpolicek


When we use more than one field from structure after explode, all fields will 
be read.

Example:
1) Loading data
{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData1": "a", "itemData2": 11},
   {"itemId": 2, "itemData1": "b", "itemData2": 22}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read
.select(explode('items).as('item))
.select($"item.itemId", $"item.itemData1")
.explain
// ReadSchema: 
struct>>
{code}
We use only *itemId* and *itemData1* fields from structure in array, but read 
schema contains *itemData2* field as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42872) Spark SQL reads unnecessary nested fields

2023-03-20 Thread Jiri Humpolicek (Jira)
Jiri Humpolicek created SPARK-42872:
---

 Summary: Spark SQL reads unnecessary nested fields
 Key: SPARK-42872
 URL: https://issues.apache.org/jira/browse/SPARK-42872
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.2
Reporter: Jiri Humpolicek


When we use high order functions in spark sql query, it would be great if it 
will be somehow possible to write following example in way that spark will read 
only necessary nested fields.

Example:
1) Loading data
{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(transform($"items", 
i=>i.getItem("itemId")).as('itemIds)).explain(true)
// ReadSchema: struct>>
{code}
We use only *itemId* field from structure in array, but read schema contains 
all fields of structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39696) Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration

2022-09-15 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605615#comment-17605615
 ] 

Jiri Humpolicek commented on SPARK-39696:
-

same for me. Is it possible to extend any timeout?

> Uncaught exception in thread executor-heartbeater 
> java.util.ConcurrentModificationException: mutation occurred during iteration
> ---
>
> Key: SPARK-39696
> URL: https://issues.apache.org/jira/browse/SPARK-39696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 (spark-3.3.0-bin-hadoop3-scala2.13 
> distribution)
> Scala 2.13.8 / OpenJDK 17.0.3 application compilation
> Alpine Linux 3.14.3
> JVM OpenJDK 64-Bit Server VM Temurin-17.0.1+12
>Reporter: Stephen Mcmullan
>Priority: Major
>
> {noformat}
> 2022-06-21 18:17:49.289Z ERROR [executor-heartbeater] 
> org.apache.spark.util.Utils - Uncaught exception in thread 
> executor-heartbeater
> java.util.ConcurrentModificationException: mutation occurred during iteration
> at 
> scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:873) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:869) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.immutable.VectorStatics$.append1IfSpace(Vector.scala:1959) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector1.appendedAll0(Vector.scala:425) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:203) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:113) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat$(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractSeq.concat(Seq.scala:1161) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus$(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.$plus$plus(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.TaskMetrics.accumulators(TaskMetrics.scala:261) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$1(Executor.scala:1042)
>  ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1036) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
> ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  ~[?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  ~[?:?]
>   

[jira] [Comment Edited] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448412#comment-17448412
 ] 

Jiri Humpolicek edited comment on SPARK-37450 at 11/24/21, 7:39 AM:


thanks for answer. Maybe I don't know internals of parquet, but when I need to 
know number of rows in parquet:
{code:java}
read.select(count(lit(1))).explain(true)
// ReadSchema: struct<>{code}
there is empty read schema (I suppose none column accessed).

So is there any way how to get size of array in parquet without reading whole 
sub-structure? If it is not, you showed at least optimization, read the 
"smallest" attribute in sub-structure.

And I think that is the job for optimizer, user needs just size of array.


was (Author: yuryn):
thanks for answer. Maybe I don't know internals of parquet, but when I need to 
know number of rows in parquet:
{code:java}
read.select(count(lit(1))).explain(true)
// ReadSchema: struct<>{code}
there is empty read schema (I suppose none column accessed).

So is there any way how to get size of array in parquet without reading whole 
sub-structure? If it is not, you showed at least optimization, read the 
"smallest" attribute in sub-structure.

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448412#comment-17448412
 ] 

Jiri Humpolicek commented on SPARK-37450:
-

thanks for answer. Maybe I don't know internals of parquet, but when I need to 
know number of rows in parquet:
{code:java}
read.select(count(lit(1))).explain(true)
// ReadSchema: struct<>{code}
there is empty read schema (I suppose none column accessed).

So is there any way how to get size of array in parquet without reading whole 
sub-structure? If it is not, you showed at least optimization, read the 
"smallest" attribute in sub-structure.

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448001#comment-17448001
 ] 

Jiri Humpolicek commented on SPARK-37450:
-

same situation with {{size}} function:
{code:scala}
read.select(size('items)).explain(true)
// ReadSchema: struct>>
{code}

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread Jiri Humpolicek (Jira)
Jiri Humpolicek created SPARK-37450:
---

 Summary: Spark SQL reads unnecessary nested fields (another type 
of pruning case)
 Key: SPARK-37450
 URL: https://issues.apache.org/jira/browse/SPARK-37450
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Jiri Humpolicek


Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
Maybe I found another nested fields pruning case. In this case I found full 
read with `count` function

Example:
1) Loading data

{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}

2) read query with explain

{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
// ReadSchema: struct>>
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34640) unable to access grouping column after groupBy

2021-03-05 Thread Jiri Humpolicek (Jira)
Jiri Humpolicek created SPARK-34640:
---

 Summary: unable to access grouping column after groupBy
 Key: SPARK-34640
 URL: https://issues.apache.org/jira/browse/SPARK-34640
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Jiri Humpolicek


When I group by nested column, I am unable to reference it after groupBy 
operation.

 Example:
 1) Preparing dataframe with nested column:
{code:scala}
case class Sub(a2: String)
case class Top(a1: String, s: Sub)

val s = Seq(
Top("r1", Sub("s1")),
Top("r2", Sub("s3"))
)
val df = s.toDF

df.printSchema
// root
//  |-- a1: string (nullable = true)
//  |-- s: struct (nullable = true)
//  ||-- a2: string (nullable = true)
{code}
2) try to access grouping column after groupBy:
{code:scala}
df.groupBy($"s.a2").count.select('a2)
// org.apache.spark.sql.AnalysisException: cannot resolve '`a2`' given input 
columns: [count, s.a2];
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode

2021-03-05 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295859#comment-17295859
 ] 

Jiri Humpolicek commented on SPARK-29721:
-

here it is [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638]

> Spark SQL reads unnecessary nested fields after using explode
> -
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kai Kang
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-03-05 Thread Jiri Humpolicek (Jira)
Jiri Humpolicek created SPARK-34638:
---

 Summary: Spark SQL reads unnecessary nested fields (another type 
of pruning case)
 Key: SPARK-34638
 URL: https://issues.apache.org/jira/browse/SPARK-34638
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Jiri Humpolicek


Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] I 
found another nested fields pruning case.

Example:
1) Loading data

{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}

2) read query with explain

{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
// ReadSchema: struct>>
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode

2021-03-03 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294414#comment-17294414
 ] 

Jiri Humpolicek commented on SPARK-29721:
-

I tested this on spark-3.1.1 and this works properly. But when I change the 
query to:

{{read.select(explode($"items").as('item)).select($"item.itemId").explain(true)}}

read schema is still "complete":

{{struct>>}}

 

> Spark SQL reads unnecessary nested fields after using explode
> -
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kai Kang
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26631) Issue while reading Parquet data from Hadoop Archive files (.har)

2020-05-22 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114013#comment-17114013
 ] 

Jiri Humpolicek commented on SPARK-26631:
-

Currently I tried same thing with same result, it does not work for me to read 
parquet from har file. Moreover I can't read even json file from har. My aim 
was to minimize number of files on hdfs using har file.

> Issue while reading Parquet data from Hadoop Archive files (.har)
> -
>
> Key: SPARK-26631
> URL: https://issues.apache.org/jira/browse/SPARK-26631
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sathish
>Priority: Minor
>
> While reading Parquet file from Hadoop Archive file Spark is failing with 
> below exception
>  
> {code:java}
> scala> val hardf = 
> sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet") 
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
>    at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
>    at scala.Option.getOrElse(Option.scala:121)   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
>    at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
>    at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)  
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)   ... 
> 49 elided
> {code}
>  
> Whereas the same parquet file can be read normally without any issues
> {code:java}
> scala> val df = 
> sqlContext.read.parquet("hdfs:///tmp/testparquet/userdata1.parquet")
> df: org.apache.spark.sql.DataFrame = [registration_dttm: timestamp, id: int 
> ... 11 more fields]{code}
>  
> +Here are the steps to reproduce the issue+
>  
> a) hadoop fs -mkdir /tmp/testparquet
> b) Get sample parquet data and rename the file to userdata1.parquet
> wget 
> [https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true]
> c) hadoop fs -put userdata.parquet /tmp/testparquet
> d) hadoop archive -archiveName testarchive.har -p /tmp/testparquet /tmp
> e) We should be able to see the file under har file
> hadoop fs -ls har:///tmp/testarchive.har
> f) Launch spark2 / spark shell
> g)
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>     val df = 
> sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet"){code}
> is there anything which I am missing here.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-06-29 Thread Jiri Humpolicek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527327#comment-16527327
 ] 

Jiri Humpolicek commented on SPARK-24018:
-

Hi, I have got exactly same problem with spark-2.3.1 without hadoop and I have 
spent nice morning till I have found this issue. It would be fine to fix this 
issue.

> Spark-without-hadoop package fails to create or read parquet files with 
> snappy compression
> --
>
> Key: SPARK-24018
> URL: https://issues.apache.org/jira/browse/SPARK-24018
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
> Spark fails to read or write dataframes in parquet format with snappy 
> compression.
> This is due to an incompatibility between the snappy-java version that is 
> required by parquet (parquet is provided in Spark jars but snappy isn't) and 
> the version that is available from hadoop-2.8.3.
>  
> Steps to reproduce:
>  * Download and extract hadoop-2.8.3
>  * Download and extract spark-2.3.0-without-hadoop
>  * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
>  * Following instructions from 
> [https://spark.apache.org/docs/latest/hadoop-provided.html], set 
> SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
>  * Start a spark-shell, enter the following:
>  
> {code:java}
> import spark.implicits._
> val df = List(1, 2, 3, 4).toDF
> df.write
>   .format("parquet")
>   .option("compression", "snappy")
>   .mode("overwrite")
>   .save("test.parquet")
> {code}
>  
>  
> This fails with the following:
> {noformat}
> java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
> at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
> at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
> at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
> at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
> at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
> at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
> at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
> at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at