[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Description: We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, "java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions ** No need to place any data files as this happens with empty location. * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source)~ ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ ~at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ ~GC overhead limit exceeded~ ~java.lang.OutOfMemoryError: GC overhead limit exceeded~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.Physic
[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Description: We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, "java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source)~ ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ ~at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ ~GC overhead limit exceeded~ ~java.lang.OutOfMemoryError: GC overhead limit exceeded~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.
[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Attachment: test_and_twodays_simplified.sql > Driver OOM happens in query planning phase with empty tables > > > Key: SPARK-46981 > URL: https://issues.apache.org/jira/browse/SPARK-46981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 > Environment: * OSS Spark 3.5.0 > * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) > * AWS Glue Spark 3.3.0 (Glue version 4.0) >Reporter: Noritaka Sekiyama >Priority: Major > Attachments: create_sanitized_tables.py, > test_and_twodays_simplified.sql > > > We have observed that Driver OOM happens in query planning phase with empty > tables when we ran specific patterns of queries. > h2. Issue details > If we run the query with where condition {{pt>='20231004' and pt<='20231004', > then the query fails in planning phase due to Driver OOM, more specifically, > }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit > exceeded{}}}{{{}{}}}. > If we change the where condition from {{pt>='20231004' and pt<='20231004'}} > to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. > > This issue happened even with empty table, and it happened before actual data > load. This seems like an issue in catalyst side. > h2. Reproduction step > Attaching script and query to reproduce the issue. > * create_sanitized_tables.py: Script to create table definitions > * test_and_twodays_simplified.sql: Query to reproduce the issue > Here's the typical stacktrace: > {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) > at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) > at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) > at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) > at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) > GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collec
[jira] [Created] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
Noritaka Sekiyama created SPARK-46981: - Summary: Driver OOM happens in query planning phase with empty tables Key: SPARK-46981 URL: https://issues.apache.org/jira/browse/SPARK-46981 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Environment: * OSS Spark 3.5.0 * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) * AWS Glue Spark 3.3.0 (Glue version 4.0) Reporter: Noritaka Sekiyama Attachments: create_sanitized_tables.py We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit exceeded{}}}{{{}{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) at scala.collection.immutable.Vector.iterator(Vector.scala:69) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.Vector.iterator(Vector.scala:100) at scala.collection.immutable.Vector.iterator(Vector.scala:69) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Attachment: create_sanitized_tables.py > Driver OOM happens in query planning phase with empty tables > > > Key: SPARK-46981 > URL: https://issues.apache.org/jira/browse/SPARK-46981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 > Environment: * OSS Spark 3.5.0 > * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) > * AWS Glue Spark 3.3.0 (Glue version 4.0) >Reporter: Noritaka Sekiyama >Priority: Major > Attachments: create_sanitized_tables.py > > > We have observed that Driver OOM happens in query planning phase with empty > tables when we ran specific patterns of queries. > h2. Issue details > If we run the query with where condition {{pt>='20231004' and pt<='20231004', > then the query fails in planning phase due to Driver OOM, more specifically, > }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit > exceeded{}}}{{{}{}}}. > If we change the where condition from {{pt>='20231004' and pt<='20231004'}} > to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. > > This issue happened even with empty table, and it happened before actual data > load. This seems like an issue in catalyst side. > h2. Reproduction step > Attaching script and query to reproduce the issue. > * create_sanitized_tables.py: Script to create table definitions > * test_and_twodays_simplified.sql: Query to reproduce the issue > Here's the typical stacktrace: > {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) > at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) > at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) > at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) > at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) > GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.sc
[jira] [Updated] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics
[ https://issues.apache.org/jira/browse/SPARK-33266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-33266: -- Description: Sometimes we need to identify performance bottlenecks, for example, how long it took to read from data store, how long it took to write into another data store. It would be great if we can have total duration, read duration, and write duration as task level metrics. Currently it seems that both `InputMetrics` and `OutputMetrics` do not have duration related metrics. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56] On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. We might need similar metrics for input/output. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] was: Sometimes we need to identify performance bottlenecks, for example, how long it took to read from data store, how long it took to write into another data store. It would be great if we can have total duration, read duration, and write duration as task level metrics. Currently it seems that both `InputMetrics` and `OutputMetrics` do not have duration related metrics. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56 ] On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. We might need similar metrics for input/output. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] > Add total duration, read duration, and write duration as task level metrics > --- > > Key: SPARK-33266 > URL: https://issues.apache.org/jira/browse/SPARK-33266 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Noritaka Sekiyama >Priority: Major > > Sometimes we need to identify performance bottlenecks, for example, how long > it took to read from data store, how long it took to write into another data > store. > It would be great if we can have total duration, read duration, and write > duration as task level metrics. > Currently it seems that both `InputMetrics` and `OutputMetrics` do not have > duration related metrics. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56] > > On the other hand, other metrics such as `ShuffleWriteMetrics` has write > time. We might need similar metrics for input/output. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics
Noritaka Sekiyama created SPARK-33266: - Summary: Add total duration, read duration, and write duration as task level metrics Key: SPARK-33266 URL: https://issues.apache.org/jira/browse/SPARK-33266 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Noritaka Sekiyama Sometimes we need to identify performance bottlenecks, for example, how long it took to read from data store, how long it took to write into another data store. It would be great if we can have total duration, read duration, and write duration as task level metrics. Currently it seems that both `InputMetrics` and `OutputMetrics` do not have duration related metrics. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56 ] On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. We might need similar metrics for input/output. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat
[ https://issues.apache.org/jira/browse/SPARK-32432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32432: -- Description: Hive style symlink (SymlinkTextInputFormat) is commonly used in different analytic engines including prestodb and prestosql. Currently SymlinkTextInputFormat works with JSON/CSV files but does not work with ORC/Parquet files in Apache Spark (and Apache Hive). On the other hand, prestodb and prestosql support SymlinkTextInputFormat with ORC/Parquet files. This issue is to add support for reading ORC/Parquet files with SymlinkTextInputFormat in Apache Spark. Related links * Hive's SymlinkTextInputFormat: [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java] * prestosql's implementation to add support for reading avro files with SymlinkTextInputFormat: [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java] was: Hive style symlink (SymlinkTextInputFormat) is commonly used in different analytic engines including prestodb and prestosql. Currently SymlinkTextInputFormat works with JSON/CSV files but does not work with ORC/Parquet files in Apache Spark. This issue is to add support for reading ORC/Parquet files with SymlinkTextInputFormat. > Add support for reading ORC/Parquet files with SymlinkTextInputFormat > - > > Key: SPARK-32432 > URL: https://issues.apache.org/jira/browse/SPARK-32432 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > Hive style symlink (SymlinkTextInputFormat) is commonly used in different > analytic engines including prestodb and prestosql. > Currently SymlinkTextInputFormat works with JSON/CSV files but does not work > with ORC/Parquet files in Apache Spark (and Apache Hive). > On the other hand, prestodb and prestosql support SymlinkTextInputFormat with > ORC/Parquet files. > This issue is to add support for reading ORC/Parquet files with > SymlinkTextInputFormat in Apache Spark. > > Related links > * Hive's SymlinkTextInputFormat: > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java] > * prestosql's implementation to add support for reading avro files with > SymlinkTextInputFormat: > [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat
Noritaka Sekiyama created SPARK-32432: - Summary: Add support for reading ORC/Parquet files with SymlinkTextInputFormat Key: SPARK-32432 URL: https://issues.apache.org/jira/browse/SPARK-32432 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Noritaka Sekiyama Hive style symlink (SymlinkTextInputFormat) is commonly used in different analytic engines including prestodb and prestosql. Currently SymlinkTextInputFormat works with JSON/CSV files but does not work with ORC/Parquet files in Apache Spark. This issue is to add support for reading ORC/Parquet files with SymlinkTextInputFormat. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time
[ https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32112: -- Description: Repartition/coalesce is very important to optimize Spark application's performance, however, a lot of users are struggling with determining the number of partitions. This issue is to add a easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time. It will help Spark users to determine the optimal number of partitions. Expected use-cases: - repartition with the calculated parallel tasks Notes: - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed by Spark apps. - `SparkContext.getExecutorMemoryStatus` might help to calculate the number of available slots to process tasks. was: Repartition/coalesce is very important to optimize Spark application's performance, however, a lot of users are struggling with determining the number of partitions. This issue is to add a easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time. It will help Spark users to determine the optimal number of partitions. Expected use-cases: - repartition with the calculated parallel tasks There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by Spark apps. > Easier way to repartition/coalesce DataFrames based on the number of parallel > tasks that Spark can process at the same time > --- > > Key: SPARK-32112 > URL: https://issues.apache.org/jira/browse/SPARK-32112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > Repartition/coalesce is very important to optimize Spark application's > performance, however, a lot of users are struggling with determining the > number of partitions. > This issue is to add a easier way to repartition/coalesce DataFrames based > on the number of parallel tasks that Spark can process at the same time. > It will help Spark users to determine the optimal number of partitions. > Expected use-cases: > - repartition with the calculated parallel tasks > Notes: > - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed > by Spark apps. > - `SparkContext.getExecutorMemoryStatus` might help to calculate the number > of available slots to process tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time
[ https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32112: -- Description: Repartition/coalesce is very important to optimize Spark application's performance, however, a lot of users are struggling with determining the number of partitions. This issue is to add a easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time. It will help Spark users to determine the optimal number of partitions. Expected use-cases: - repartition with the calculated parallel tasks There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by Spark apps. was: Repartition/coalesce is very important to optimize Spark application's performance, however, a lot of users are struggling with determining the number of partitions. This issue is to add a method to calculate the number of parallel tasks that Spark can process at the same time. It will help Spark users to determine the optimal number of partitions. Expected use-cases: - repartition with the calculated parallel tasks > Easier way to repartition/coalesce DataFrames based on the number of parallel > tasks that Spark can process at the same time > --- > > Key: SPARK-32112 > URL: https://issues.apache.org/jira/browse/SPARK-32112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > Repartition/coalesce is very important to optimize Spark application's > performance, however, a lot of users are struggling with determining the > number of partitions. > This issue is to add a easier way to repartition/coalesce DataFrames based > on the number of parallel tasks that Spark can process at the same time. > It will help Spark users to determine the optimal number of partitions. > Expected use-cases: > - repartition with the calculated parallel tasks > > There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by > Spark apps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time
[ https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32112: -- Summary: Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time (was: Add a method to calculate the number of parallel tasks that Spark can process at the same time) > Easier way to repartition/coalesce DataFrames based on the number of parallel > tasks that Spark can process at the same time > --- > > Key: SPARK-32112 > URL: https://issues.apache.org/jira/browse/SPARK-32112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > Repartition/coalesce is very important to optimize Spark application's > performance, however, a lot of users are struggling with determining the > number of partitions. > This issue is to add a method to calculate the number of parallel tasks that > Spark can process at the same time. > It will help Spark users to determine the optimal number of partitions. > Expected use-cases: > - repartition with the calculated parallel tasks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32112) Add a method to calculate the number of parallel tasks that Spark can process at the same time
Noritaka Sekiyama created SPARK-32112: - Summary: Add a method to calculate the number of parallel tasks that Spark can process at the same time Key: SPARK-32112 URL: https://issues.apache.org/jira/browse/SPARK-32112 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Noritaka Sekiyama Repartition/coalesce is very important to optimize Spark application's performance, however, a lot of users are struggling with determining the number of partitions. This issue is to add a method to calculate the number of parallel tasks that Spark can process at the same time. It will help Spark users to determine the optimal number of partitions. Expected use-cases: - repartition with the calculated parallel tasks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC
[ https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32013: -- Description: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. However, this query is only for reading data, and it does not support the common examples listed above. On the other hand, there is `sessionInitStatement` option available before writing data from DataFrame. This option is to run custom SQL in order to implement session initialization code. Since it runs per session, it cannot be used for non-idempotent operations. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] was: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. However, this query is only for reading data, and it does not support the common examples listed above. On the other hand, there is `sessionInitStatement` option available before writing data from DataFrame. This option is to run custom SQL in order to implement session initialization code. Since it runs per session, it cannot be used for write operations. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] > Support query execution before/after reading/writing over JDBC > -- > > Key: SPARK-32013 > URL: https://issues.apache.org/jira/browse/SPARK-32013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > For ETL workload, there is a common requirement to perform SQL statement > before/after reading/writing over JDBC. > Here's examples; > - Create a view with specific conditions > - Delete/Update some records > - Truncate a table (it is already possible in `truncate` option) > - Execute stored procedure (it is also requested in SPARK-32014) > Currently `query` options is available to specify SQL statement against JDBC > datasource when loading data as DataFrame. > However, this query is only for reading data, and it does not support the > common examples listed above. > On the other hand, there is `sessionInitStatement` option available before > writing data from DataFrame. > This option is to run custom SQL in order to implement session > initialization code. Since it runs per session, it cannot be used for > non-idempotent operations. > > If Spark can support executing SQL statement against JDBC datasources > before/after reading/writing over JDBC, it can cover a lot of common > use-cases. > Note: Databricks' old Redshift connector has similar option like `preactions` > and `postactions`. [https://github.com/databricks/spark-redshift] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC
[ https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32013: -- Description: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. However, this query is only for reading data, and it does not support the common examples listed above. On the other hand, there is `sessionInitStatement` option available before writing data from DataFrame. This option is to run custom SQL in order to implement session initialization code. Since it runs per session, it cannot be used for write operations. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] was: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] > Support query execution before/after reading/writing over JDBC > -- > > Key: SPARK-32013 > URL: https://issues.apache.org/jira/browse/SPARK-32013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > For ETL workload, there is a common requirement to perform SQL statement > before/after reading/writing over JDBC. > Here's examples; > - Create a view with specific conditions > - Delete/Update some records > - Truncate a table (it is already possible in `truncate` option) > - Execute stored procedure (it is also requested in SPARK-32014) > Currently `query` options is available to specify SQL statement against JDBC > datasource when loading data as DataFrame. > However, this query is only for reading data, and it does not support the > common examples listed above. > On the other hand, there is `sessionInitStatement` option available before > writing data from DataFrame. > This option is to run custom SQL in order to implement session initialization > code. Since it runs per session, it cannot be used for write operations. > > If Spark can support executing SQL statement against JDBC datasources > before/after reading/writing over JDBC, it can cover a lot of common > use-cases. > Note: Databricks' old Redshift connector has similar option like `preactions` > and `postactions`. [https://github.com/databricks/spark-redshift] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC
[ https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32013: -- Description: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] was: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. https://github.com/databricks/spark-redshift > Support query execution before/after reading/writing over JDBC > -- > > Key: SPARK-32013 > URL: https://issues.apache.org/jira/browse/SPARK-32013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > For ETL workload, there is a common requirement to perform SQL statement > before/after reading/writing over JDBC. > Here's examples; > - Create a view with specific conditions > - Delete/Update some records > - Truncate a table (it is already possible in `truncate` option) > - Execute stored procedure (it is also requested in SPARK-32014) > Currently `query` options is available to specify SQL statement against JDBC > datasource when loading data as DataFrame. > [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] > However, this query is only for reading data, and it does not support the > common examples listed above. > If Spark can support executing SQL statement against JDBC datasources > before/after reading/writing over JDBC, it can cover a lot of common > use-cases. > Note: Databricks' old Redshift connector has similar option like `preactions` > and `postactions`. [https://github.com/databricks/spark-redshift] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC
[ https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32013: -- Description: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. https://github.com/databricks/spark-redshift was: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. > Support query execution before/after reading/writing over JDBC > -- > > Key: SPARK-32013 > URL: https://issues.apache.org/jira/browse/SPARK-32013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > For ETL workload, there is a common requirement to perform SQL statement > before/after reading/writing over JDBC. > Here's examples; > - Create a view with specific conditions > - Delete/Update some records > - Truncate a table (it is already possible in `truncate` option) > - Execute stored procedure > Currently `query` options is available to specify SQL statement against JDBC > datasource when loading data as DataFrame. > [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] > However, this query is only for reading data, and it does not support the > common examples listed above. > If Spark can support executing SQL statement against JDBC datasources > before/after reading/writing over JDBC, it can cover a lot of common > use-cases. > Note: Databricks' old Redshift connector has similar option like `preactions` > and `postactions`. https://github.com/databricks/spark-redshift -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC
[ https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32013: -- Description: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. was: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. > Support query execution before/after reading/writing over JDBC > -- > > Key: SPARK-32013 > URL: https://issues.apache.org/jira/browse/SPARK-32013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > For ETL workload, there is a common requirement to perform SQL statement > before/after reading/writing over JDBC. > Here's examples; > - Create a view with specific conditions > - Delete/Update some records > - Truncate a table (it is already possible in `truncate` option) > - Execute stored procedure > Currently `query` options is available to specify SQL statement against JDBC > datasource when loading data as DataFrame. > https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html > However, this query is only for reading data, and it does not support the > common examples listed above. > If Spark can support executing SQL statement against JDBC datasources > before/after reading/writing over JDBC, it can cover a lot of common > use-cases. > Note: Databricks' old Redshift connector has similar option like `preactions` > and `postactions`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32013) Support query execution before/after reading/writing over JDBC
Noritaka Sekiyama created SPARK-32013: - Summary: Support query execution before/after reading/writing over JDBC Key: SPARK-32013 URL: https://issues.apache.org/jira/browse/SPARK-32013 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Noritaka Sekiyama For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28069) Switch log directory from Spark UI without restarting history server
[ https://issues.apache.org/jira/browse/SPARK-28069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-28069: -- Description: History server polls the directory specified in spark.history.fs.logDirectory, and displays the event logs based on the data included in the directory. In the Cloud, event logs might be written into different directories from different clusters. Currently it is not so easy to launch history server for multiple clusters because there is no ways to switch log directory after starting the history server. If users can switch log directory from Spark UI without restarting history server, it would be beneficial for users who wants to use Cloud storage for event log directory. Suggested implementation: * History server polls the directory specified in spark.history.fs.logDirectory when it started. * Add a button of "Switch log directory" at the top page of Web UI. * The directory which users selected is stored only in memory. (It won't be persistent configuration.) * Next time when history server started, the directory specified in spark.history.fs.logDirectory is used. was: History server polls the directory specified in spark.history.fs.logDirectory, and displays the event logs based on the data included in the directory. In the Cloud, event logs might be written into different directories from different clusters. Currently it is not so easy to launch history server for multiple clusters because there is no ways to switch log directory after starting the history server. If users can switch log directory from Spark UI without restarting history server, it would be beneficial for users who wants to use Cloud storage for event log directory. Suggested implementation * History server polls the directory specified in spark.history.fs.logDirectory when it started. * Add a button of "Switch log directory" at the top page of Web UI. * The directory which users selected is stored only in memory. (It won't be persistent configuration.) * Next time when history server started, the directory specified in spark.history.fs.logDirectory is used. > Switch log directory from Spark UI without restarting history server > > > Key: SPARK-28069 > URL: https://issues.apache.org/jira/browse/SPARK-28069 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.3 >Reporter: Noritaka Sekiyama >Priority: Minor > > History server polls the directory specified in > spark.history.fs.logDirectory, and displays the event logs based on the data > included in the directory. > In the Cloud, event logs might be written into different directories from > different clusters. > Currently it is not so easy to launch history server for multiple clusters > because there is no ways to switch log directory after starting the history > server. > If users can switch log directory from Spark UI without restarting history > server, it would be beneficial for users who wants to use Cloud storage for > event log directory. > Suggested implementation: > * History server polls the directory specified in > spark.history.fs.logDirectory when it started. > * Add a button of "Switch log directory" at the top page of Web UI. > * The directory which users selected is stored only in memory. (It won't be > persistent configuration.) > * Next time when history server started, the directory specified in > spark.history.fs.logDirectory is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28069) Switch log directory from Spark UI without restarting history server
[ https://issues.apache.org/jira/browse/SPARK-28069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-28069: -- Description: History server polls the directory specified in spark.history.fs.logDirectory, and displays the event logs based on the data included in the directory. In the Cloud, event logs might be written into different directories from different clusters. Currently it is not so easy to launch history server for multiple clusters because there is no ways to switch log directory after starting the history server. If users can switch log directory from Spark UI without restarting history server, it would be beneficial for users who wants to use Cloud storage for event log directory. Suggested implementation * History server polls the directory specified in spark.history.fs.logDirectory when it started. * Add a button of "Switch log directory" at the top page of Web UI. * The directory which users selected is stored only in memory. (It won't be persistent configuration.) * Next time when history server started, the directory specified in spark.history.fs.logDirectory is used. was: History server polls the directory specified in spark.history.fs.logDirectory, and displays the event logs based on the data included in the directory. In the Cloud, event logs might be written into different directories from different clusters. Currently it is not so easy to launch history server for multiple clusters because there is no ways to switch log directory after starting the history server. If users can switch log directory from Spark UI without restarting history server, it would be beneficial for users who wants to use Cloud storage for event log directory. > Switch log directory from Spark UI without restarting history server > > > Key: SPARK-28069 > URL: https://issues.apache.org/jira/browse/SPARK-28069 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.3 >Reporter: Noritaka Sekiyama >Priority: Minor > > History server polls the directory specified in > spark.history.fs.logDirectory, and displays the event logs based on the data > included in the directory. > In the Cloud, event logs might be written into different directories from > different clusters. > Currently it is not so easy to launch history server for multiple clusters > because there is no ways to switch log directory after starting the history > server. > If users can switch log directory from Spark UI without restarting history > server, it would be beneficial for users who wants to use Cloud storage for > event log directory. > Suggested implementation > * History server polls the directory specified in > spark.history.fs.logDirectory when it started. > * Add a button of "Switch log directory" at the top page of Web UI. > * The directory which users selected is stored only in memory. (It won't be > persistent configuration.) > * Next time when history server started, the directory specified in > spark.history.fs.logDirectory is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28069) Switch log directory from Spark UI without restarting history server
Noritaka Sekiyama created SPARK-28069: - Summary: Switch log directory from Spark UI without restarting history server Key: SPARK-28069 URL: https://issues.apache.org/jira/browse/SPARK-28069 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.3 Reporter: Noritaka Sekiyama History server polls the directory specified in spark.history.fs.logDirectory, and displays the event logs based on the data included in the directory. In the Cloud, event logs might be written into different directories from different clusters. Currently it is not so easy to launch history server for multiple clusters because there is no ways to switch log directory after starting the history server. If users can switch log directory from Spark UI without restarting history server, it would be beneficial for users who wants to use Cloud storage for event log directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also
[ https://issues.apache.org/jira/browse/SPARK-21514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865228#comment-16865228 ] Noritaka Sekiyama commented on SPARK-21514: --- To move data from S3 (s3a) to HDFS, there is a problem. Current implementation of Hive 1.2 does not support data movement across different file systems (Hive 2.0 supports it). If we try to implement this without Hive version upgrade, it means we need to backport some implementation from Hive 2.0. When I tried, the patch included so much diff. It would be better to upgrade Hive version at first, then I can submit the patch without backporting. > Hive has updated with new support for S3 and InsertIntoHiveTable.scala should > update also > - > > Key: SPARK-21514 > URL: https://issues.apache.org/jira/browse/SPARK-21514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Javier Ros >Priority: Major > > Hive has updated adding new parameters to optimize the usage of S3, now you > can avoid the usage of S3 as the stagingdir using the parameters > hive.blobstore.supported.schemes & hive.blobstore.optimizations.enabled. > The InsertIntoHiveTable.scala file should be updated with the same > improvement to match the behavior of Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also
[ https://issues.apache.org/jira/browse/SPARK-21514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724763#comment-16724763 ] Noritaka Sekiyama commented on SPARK-21514: --- I'm working on fixing this. Will update once I have done. Please let me know if anyone is already working on this. > Hive has updated with new support for S3 and InsertIntoHiveTable.scala should > update also > - > > Key: SPARK-21514 > URL: https://issues.apache.org/jira/browse/SPARK-21514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Javier Ros >Priority: Major > > Hive has updated adding new parameters to optimize the usage of S3, now you > can avoid the usage of S3 as the stagingdir using the parameters > hive.blobstore.supported.schemes & hive.blobstore.optimizations.enabled. > The InsertIntoHiveTable.scala file should be updated with the same > improvement to match the behavior of Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18432) Fix HDFS block size in programming guide
Noritaka Sekiyama created SPARK-18432: - Summary: Fix HDFS block size in programming guide Key: SPARK-18432 URL: https://issues.apache.org/jira/browse/SPARK-18432 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.0.1 Reporter: Noritaka Sekiyama Priority: Minor http://spark.apache.org/docs/latest/programming-guide.html "By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS)" Currently default block size in HDFS is 128MB. The default value has been already increased in Hadoop 2.2.0 (the oldest supported version of Spark). https://issues.apache.org/jira/browse/HDFS-4053 Since it looks confusing explanation, I'd like to fix the value from 64MB to 128MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org