[jira] [Resolved] (SPARK-28723) Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile
[ https://issues.apache.org/jira/browse/SPARK-28723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28723. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25443 [https://github.com/apache/spark/pull/25443] > Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile > - > > Key: SPARK-28723 > URL: https://issues.apache.org/jira/browse/SPARK-28723 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914795#comment-16914795 ] kevin yu commented on SPARK-28833: -- @aman omer : I am about the half way there.. and you can help review? thanks > Document ALTER VIEW Statement in SQL Reference. > --- > > Key: SPARK-28833 > URL: https://issues.apache.org/jira/browse/SPARK-28833 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28778) Shuffle jobs fail due to incorrect advertised address when running in virtual network
[ https://issues.apache.org/jira/browse/SPARK-28778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914762#comment-16914762 ] Dongjoon Hyun commented on SPARK-28778: --- You are added to the Apache Spark contributor group. Thank you for your first contribution and welcome! > Shuffle jobs fail due to incorrect advertised address when running in virtual > network > - > > Key: SPARK-28778 > URL: https://issues.apache.org/jira/browse/SPARK-28778 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.3, 2.3.0, 2.4.3 >Reporter: Anton Kirillov >Assignee: Anton Kirillov >Priority: Major > Labels: Mesos > Fix For: 3.0.0 > > > When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler > sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when > {{spark.mesos.network.name}} is provided. This makes executors use > {{0.0.0.0}} as their advertised address and, in the presence of shuffle, > executors fail to fetch shuffle blocks from each other using {{0.0.0.0}} as > the origin. When a virtual network is used the hostname or IP address is not > known upfront and assigned to a container at its start time so the executor > process needs to advertise the correct dynamically assigned address to be > reachable by other executors. > h3. > The bug described above prevents Mesos users from running any jobs which > involve shuffle due to the inability of executors to fetch shuffle blocks > because of incorrect advertised address when virtual network is used. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28778) Shuffle jobs fail due to incorrect advertised address when running in virtual network
[ https://issues.apache.org/jira/browse/SPARK-28778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28778: - Assignee: Anton Kirillov > Shuffle jobs fail due to incorrect advertised address when running in virtual > network > - > > Key: SPARK-28778 > URL: https://issues.apache.org/jira/browse/SPARK-28778 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.3, 2.3.0, 2.4.3 >Reporter: Anton Kirillov >Assignee: Anton Kirillov >Priority: Major > Labels: Mesos > Fix For: 3.0.0 > > > When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler > sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when > {{spark.mesos.network.name}} is provided. This makes executors use > {{0.0.0.0}} as their advertised address and, in the presence of shuffle, > executors fail to fetch shuffle blocks from each other using {{0.0.0.0}} as > the origin. When a virtual network is used the hostname or IP address is not > known upfront and assigned to a container at its start time so the executor > process needs to advertise the correct dynamically assigned address to be > reachable by other executors. > h3. > The bug described above prevents Mesos users from running any jobs which > involve shuffle due to the inability of executors to fetch shuffle blocks > because of incorrect advertised address when virtual network is used. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28778) Shuffle jobs fail due to incorrect advertised address when running in virtual network
[ https://issues.apache.org/jira/browse/SPARK-28778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28778. --- Fix Version/s: 3.0.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25500 > Shuffle jobs fail due to incorrect advertised address when running in virtual > network > - > > Key: SPARK-28778 > URL: https://issues.apache.org/jira/browse/SPARK-28778 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.3, 2.3.0, 2.4.3 >Reporter: Anton Kirillov >Priority: Major > Labels: Mesos > Fix For: 3.0.0 > > > When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler > sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when > {{spark.mesos.network.name}} is provided. This makes executors use > {{0.0.0.0}} as their advertised address and, in the presence of shuffle, > executors fail to fetch shuffle blocks from each other using {{0.0.0.0}} as > the origin. When a virtual network is used the hostname or IP address is not > known upfront and assigned to a container at its start time so the executor > process needs to advertise the correct dynamically assigned address to be > reachable by other executors. > h3. > The bug described above prevents Mesos users from running any jobs which > involve shuffle due to the inability of executors to fetch shuffle blocks > because of incorrect advertised address when virtual network is used. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28858) add tree-based transformation in the py side
[ https://issues.apache.org/jira/browse/SPARK-28858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-28858. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25566 [https://github.com/apache/spark/pull/25566] > add tree-based transformation in the py side > > > Key: SPARK-28858 > URL: https://issues.apache.org/jira/browse/SPARK-28858 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > Expose the newly add tree-based transformation in the py side -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28858) add tree-based transformation in the py side
[ https://issues.apache.org/jira/browse/SPARK-28858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-28858: Assignee: zhengruifeng > add tree-based transformation in the py side > > > Key: SPARK-28858 > URL: https://issues.apache.org/jira/browse/SPARK-28858 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > Expose the newly add tree-based transformation in the py side -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914660#comment-16914660 ] hemanth meka edited comment on SPARK-23519 at 8/23/19 9:50 PM: --- PR raised [25570|[https://github.com/apache/spark/pull/25570]] Someone please review. was (Author: hem1891): PR raised [25570|[https://github.com/apache/spark/pull/25570]]. Someone please review. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914660#comment-16914660 ] hemanth meka edited comment on SPARK-23519 at 8/23/19 9:50 PM: --- PR raised [https://github.com/apache/spark/pull/25570] Someone please review. was (Author: hem1891): PR raised [25570|[https://github.com/apache/spark/pull/25570]] Someone please review. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914660#comment-16914660 ] hemanth meka commented on SPARK-23519: -- PR raised [25570|[https://github.com/apache/spark/pull/25570]]. Someone please review. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28863) Add an AlreadyPlanned logical node that skips query planning
Burak Yavuz created SPARK-28863: --- Summary: Add an AlreadyPlanned logical node that skips query planning Key: SPARK-28863 URL: https://issues.apache.org/jira/browse/SPARK-28863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz With the DataSourceV2 write operations, we have a way to fallback to the V1 writer APIs using InsertableRelation. The gross part is that we're in physical land, but the InsertableRelation takes a logical plan, so we have to pass the logical plans to these physical nodes, and then potentially go through re-planning. A useful primitive could be specifying that a plan is ready for execution through a logical node AlreadyPlanned. This would wrap a physical plan, and then we can go straight to execution. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28839) ExecutorMonitor$Tracker NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-28839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-28839: -- Assignee: Hyukjin Kwon > ExecutorMonitor$Tracker NullPointerException > > > Key: SPARK-28839 > URL: https://issues.apache.org/jira/browse/SPARK-28839 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Hyukjin Kwon >Priority: Major > > {noformat} > 19/08/21 06:44:01 ERROR AsyncEventQueue: Listener ExecutorMonitor threw an > exception > java.lang.NullPointerException > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor$Tracker.removeShuffle(ExecutorMonitor.scala:479) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2(ExecutorMonitor.scala:408) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2$adapted(ExecutorMonitor.scala:407) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.cleanupShuffle(ExecutorMonitor.scala:407) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.onOtherEvent(ExecutorMonitor.scala:351) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:82) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:99) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:84) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:102) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:102) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:97) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:93) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:93) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28839) ExecutorMonitor$Tracker NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-28839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-28839. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25551 [https://github.com/apache/spark/pull/25551] > ExecutorMonitor$Tracker NullPointerException > > > Key: SPARK-28839 > URL: https://issues.apache.org/jira/browse/SPARK-28839 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > {noformat} > 19/08/21 06:44:01 ERROR AsyncEventQueue: Listener ExecutorMonitor threw an > exception > java.lang.NullPointerException > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor$Tracker.removeShuffle(ExecutorMonitor.scala:479) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2(ExecutorMonitor.scala:408) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2$adapted(ExecutorMonitor.scala:407) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.cleanupShuffle(ExecutorMonitor.scala:407) > at > org.apache.spark.scheduler.dynalloc.ExecutorMonitor.onOtherEvent(ExecutorMonitor.scala:351) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:82) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:99) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:84) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:102) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:102) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:97) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:93) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:93) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28482) Data incomplete when using pandas udf in Python 3
[ https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-28482. -- Resolution: Not A Problem No problem [~jiangyu1211] ! I will resolve this then. In general, I wouldn't rely on print outs from the workers. If they are run as a subprocess, then you won't see them. Not sure if that's what happened in your case or not. > Data incomplete when using pandas udf in Python 3 > - > > Key: SPARK-28482 > URL: https://issues.apache.org/jira/browse/SPARK-28482 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 > Environment: centos 7.4 > pyarrow 0.10.0 0.14.0 > python 2.7 3.5 3.6 >Reporter: jiangyu >Priority: Major > Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png > > > Hi, > > Since Spark 2.3.x, pandas udf has been introduced as default ser/des method > when using udf. However, an issue raises with python >= 3.5.x version. > We use pandas udf to process batches of data, but we find the data is > incomplete in python 3.x. At first , i think the process logical maybe wrong, > so i change the code to very simple one and it has the same problem.After > investigate for a week, i find it is related to pyarrow. > > *Reproduce procedure:* > 1. prepare data > The data have seven column, a、b、c、d、e、f and g, data type is Integer > a,b,c,d,e,f,g > 1,2,3,4,5,6,7 > 1,2,3,4,5,6,7 > 1,2,3,4,5,6,7 > 1,2,3,4,5,6,7 > produce 100,000 rows and name the file test.csv ,upload to hdfs, then load > it , and repartition it to 1 partition. > > {code:java} > df=spark.read.format('csv').option("header","true").load('/test.csv') > df=df.select(*(col(c).cast("int").alias(c) for c in df.columns)) > df=df.repartition(1) > spark_context = SparkContext.getOrCreate() {code} > > 2.register pandas udf > > {code:java} > def add_func(a,b,c,d,e,f,g): > print('iterator one time') > return a > add = pandas_udf(add_func, returnType=IntegerType()) > df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code} > > 3.apply pandas udf > > {code:java} > def trigger_func(iterator): > yield iterator > df_result.rdd.foreachPartition(trigger_func){code} > > 4.execute it in pyspark (local or yarn) > run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As > mentioned before the total row number is 100, it should print "iterator > one time " 10 times. > (1)Python 2.7 envs: > > {code:java} > PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf > spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf > spark.executor.pyspark.memory=2g --conf > spark.sql.execution.arrow.enabled=true --executor-cores 1{code} > > !py2.7.png! > The result is right, 10 times of print. > > > (2)Python 3.5 or 3.6 envs: > {code:java} > PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf > spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf > spark.executor.pyspark.memory=2g --conf > spark.sql.execution.arrow.enabled=true --executor-cores{code} > > !py3.6.png! > The data is incomplete. Exception is print by jvm spark which have been added > by us , I will explain it later. > > > h3. *Investigation* > The “process done” is added in the worker.py. > !worker.png! > In order to get the exception, change the spark code, the code is under > core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to > print the exception. > > > {code:java} > @@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging { > case t: Throwable => > // Purposefully not using NonFatal, because even fatal exceptions > // we don't want to have our finallyBlock suppress > + logInfo(t.getLocalizedMessage) > + t.printStackTrace() > originalThrowable = t > throw originalThrowable > } finally {{code} > > > It seems the pyspark get the data from jvm , but pyarrow get the data > incomplete. Pyarrow side think the data is finished, then shutdown the > socket. At the same time, the jvm side still writes to the same socket , but > get socket close exception. > The pyarrow part is in ipc.pxi: > > {code:java} > cdef class _RecordBatchReader: > cdef: > shared_ptr[CRecordBatchReader] reader > shared_ptr[InputStream] in_stream > cdef readonly: > Schema schema > def _cinit_(self): > pass > def _open(self, source): > get_input_stream(source, _stream) > with nogil: > check_status(CRecordBatchStreamReader.Open( > self.in_stream.get(), )) > self.schema = pyarrow_wrap_schema(self.reader.get().schema()) > def _iter_(self): > while True: > yield self.read_next_batch() > def get_next_batch(self): > import warnings > warnings.warn('Please use read_next_batch
[jira] [Comment Edited] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914462#comment-16914462 ] Aman Omer edited comment on SPARK-28833 at 8/23/19 4:48 PM: [~kevinyu98] Since I have worked on CREATE VIEW document ([https://github.com/apache/spark/pull/25543]) and this is related. If you permit, I would like to start working on this. :) was (Author: aman_omer): [~kevinyu98] Since I have worked on CREATE VIEW document ([https://github.com/apache/spark/pull/25543]) and it is related. If you permit, I would like to start working on this. :) > Document ALTER VIEW Statement in SQL Reference. > --- > > Key: SPARK-28833 > URL: https://issues.apache.org/jira/browse/SPARK-28833 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914462#comment-16914462 ] Aman Omer edited comment on SPARK-28833 at 8/23/19 4:47 PM: [~kevinyu98] Since I have worked on CREATE VIEW document ([https://github.com/apache/spark/pull/25543]) and it is related. If you permit, I would like to start working on this. :) was (Author: aman_omer): [~kevinyu98] Since I have worked on CREATE VIEW document ([https://github.com/apache/spark/pull/25543]). If you permit, I would like to start working on this. :) > Document ALTER VIEW Statement in SQL Reference. > --- > > Key: SPARK-28833 > URL: https://issues.apache.org/jira/browse/SPARK-28833 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914462#comment-16914462 ] Aman Omer commented on SPARK-28833: --- [~kevinyu98] Since I have worked on CREATE VIEW document ([https://github.com/apache/spark/pull/25543]). If you permit, I would like to start working on this. :) > Document ALTER VIEW Statement in SQL Reference. > --- > > Key: SPARK-28833 > URL: https://issues.apache.org/jira/browse/SPARK-28833 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28862) Read Impala view with UNION ALL operator
francesco created SPARK-28862: - Summary: Read Impala view with UNION ALL operator Key: SPARK-28862 URL: https://issues.apache.org/jira/browse/SPARK-28862 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: francesco I would like to report an issue in pySpark 2.2.0 when it is used to read hive views that contain UNION ALL operator. I attach the error: WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again. Is there any solution different from materializing this view? Thanks, -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted
[ https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914321#comment-16914321 ] Dongjoon Hyun commented on SPARK-27330: --- Thank you, [~eyalzit]. You are added to the Apache Spark contributor group. > ForeachWriter is not being closed once a batch is aborted > - > > Key: SPARK-27330 > URL: https://issues.apache.org/jira/browse/SPARK-27330 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Eyal Zituny >Assignee: Eyal Zituny >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > in cases where a micro batch is being killed (interrupted), not during actual > processing done by the {{ForeachDataWriter}} (when iterating the iterator), > {{DataWritingSparkTask}} will handle the interruption and call > {{dataWriter.abort()}} > the problem is that {{ForeachDataWriter}} has an empty implementation for the > abort method. > due to that, I have tasks which uses the foreach writer and according to the > [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] > they are opening connections in the "open" method and closing the > connections on the "close" method but since the "close" is never called, the > connections are never closed > this wasn't the behavior pre spark 2.4 > my suggestion is to call {{ForeachWriter.abort()}} when > {{DataWriter.abort()}} is called, in order to notify the foreach writer that > this task has failed > > {code:java} > stack trace from the exception i have encountered: > org.apache.spark.TaskKilledException: null > at > org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66) > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted
[ https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27330: - Assignee: Eyal Zituny > ForeachWriter is not being closed once a batch is aborted > - > > Key: SPARK-27330 > URL: https://issues.apache.org/jira/browse/SPARK-27330 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Eyal Zituny >Assignee: Eyal Zituny >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > in cases where a micro batch is being killed (interrupted), not during actual > processing done by the {{ForeachDataWriter}} (when iterating the iterator), > {{DataWritingSparkTask}} will handle the interruption and call > {{dataWriter.abort()}} > the problem is that {{ForeachDataWriter}} has an empty implementation for the > abort method. > due to that, I have tasks which uses the foreach writer and according to the > [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] > they are opening connections in the "open" method and closing the > connections on the "close" method but since the "close" is never called, the > connections are never closed > this wasn't the behavior pre spark 2.4 > my suggestion is to call {{ForeachWriter.abort()}} when > {{DataWriter.abort()}} is called, in order to notify the foreach writer that > this task has failed > > {code:java} > stack trace from the exception i have encountered: > org.apache.spark.TaskKilledException: null > at > org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66) > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted
[ https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27330. --- Fix Version/s: 3.0.0 2.4.4 Resolution: Fixed > ForeachWriter is not being closed once a batch is aborted > - > > Key: SPARK-27330 > URL: https://issues.apache.org/jira/browse/SPARK-27330 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Eyal Zituny >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > in cases where a micro batch is being killed (interrupted), not during actual > processing done by the {{ForeachDataWriter}} (when iterating the iterator), > {{DataWritingSparkTask}} will handle the interruption and call > {{dataWriter.abort()}} > the problem is that {{ForeachDataWriter}} has an empty implementation for the > abort method. > due to that, I have tasks which uses the foreach writer and according to the > [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] > they are opening connections in the "open" method and closing the > connections on the "close" method but since the "close" is never called, the > connections are never closed > this wasn't the behavior pre spark 2.4 > my suggestion is to call {{ForeachWriter.abort()}} when > {{DataWriter.abort()}} is called, in order to notify the foreach writer that > this task has failed > > {code:java} > stack trace from the exception i have encountered: > org.apache.spark.TaskKilledException: null > at > org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66) > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28025: -- Fix Version/s: 2.4.4 > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables
[ https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25474: -- Fix Version/s: (was: 3.0.0) > Support `spark.sql.statistics.fallBackToHdfs` in data source tables > --- > > Key: SPARK-25474 > URL: https://issues.apache.org/jira/browse/SPARK-25474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 > Environment: Spark 2.3.1 > Hadoop 2.7.2 >Reporter: Ayush Anubhava >Priority: Major > > *Description :* Size in bytes of the query is coming in EB in case of parquet > datasource. this would impact the performance , since join queries would > always go as Sort Merge Join. > *Precondition :* spark.sql.statistics.fallBackToHdfs = true > Steps: > {code:java} > 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110; > +++--+ > | a | b | > +++--+ > | 1 | a | > | 2 | b | > +++--+ > {code} > *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}* > {code:java} > explain cost select * from t1110; > | == Optimized Logical Plan == > Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none) > == Physical Plan == > *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct | > {code} > *{color:#d04437}This would lead to Sort Merge Join in case of join > query{color}* > {code:java} > 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > explain select * from t1110 t1 join t110 t2 on t1.a=t2.a; > | == Physical Plan == > *(5) SortMergeJoin [a#23], [a#55], Inner > :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(a#23, 200) > : +- *(1) Project [a#23, b#24] > : +- *(1) Filter isnotnull(a#23) > : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: > Parquet, Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct > +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#55, 200) > +- *(3) Project [a#55, b#56] > +- *(3) Filter isnotnull(a#55) > +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], > PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct | > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables
[ https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-25474: --- Assignee: (was: shahid) > Support `spark.sql.statistics.fallBackToHdfs` in data source tables > --- > > Key: SPARK-25474 > URL: https://issues.apache.org/jira/browse/SPARK-25474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 > Environment: Spark 2.3.1 > Hadoop 2.7.2 >Reporter: Ayush Anubhava >Priority: Major > Fix For: 3.0.0 > > > *Description :* Size in bytes of the query is coming in EB in case of parquet > datasource. this would impact the performance , since join queries would > always go as Sort Merge Join. > *Precondition :* spark.sql.statistics.fallBackToHdfs = true > Steps: > {code:java} > 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110; > +++--+ > | a | b | > +++--+ > | 1 | a | > | 2 | b | > +++--+ > {code} > *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}* > {code:java} > explain cost select * from t1110; > | == Optimized Logical Plan == > Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none) > == Physical Plan == > *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct | > {code} > *{color:#d04437}This would lead to Sort Merge Join in case of join > query{color}* > {code:java} > 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > explain select * from t1110 t1 join t110 t2 on t1.a=t2.a; > | == Physical Plan == > *(5) SortMergeJoin [a#23], [a#55], Inner > :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(a#23, 200) > : +- *(1) Project [a#23, b#24] > : +- *(1) Filter isnotnull(a#23) > : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: > Parquet, Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct > +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#55, 200) > +- *(3) Project [a#55, b#56] > +- *(3) Filter isnotnull(a#55) > +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], > PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct | > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28861) Jetty property handling
Ketan created SPARK-28861: - Summary: Jetty property handling Key: SPARK-28861 URL: https://issues.apache.org/jira/browse/SPARK-28861 Project: Spark Issue Type: Wish Components: Spark Submit Affects Versions: 2.4.3 Reporter: Ketan While processing data from certain files a NumberFormatExceltion was seen in the logs. The processing was fine but the following stacktrace was observed: {"time":"2019-08-16 08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""} java.lang.NumberFormatException: For input string: "unknown". On investigation it is found that in the class Jetty there is the following: BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", "unknown")); which indicates that the config should have the 'timestamp' property. If the property is not there then the default value is set as 'unknown' and this value causes the stacktrace to show up in the logs in our application. It has no detrimental effect on the application as such but could be addressed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28741) New optional mode: Throw exceptions when casting to integers causes overflow
[ https://issues.apache.org/jira/browse/SPARK-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28741. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25461 [https://github.com/apache/spark/pull/25461] > New optional mode: Throw exceptions when casting to integers causes overflow > > > Key: SPARK-28741 > URL: https://issues.apache.org/jira/browse/SPARK-28741 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > To follow ANSI SQL, we should support a configurable mode that throws > exceptions when casting to integers causes overflow. > The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, > which throws exceptions on arithmetical operation overflow. > To unify it, the configuration is renamed from > "spark.sql.arithmeticOperations.failOnOverFlow" to > "spark.sql.failOnIntegerOverFlow" -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28741) New optional mode: Throw exceptions when casting to integers causes overflow
[ https://issues.apache.org/jira/browse/SPARK-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28741: --- Assignee: Gengliang Wang > New optional mode: Throw exceptions when casting to integers causes overflow > > > Key: SPARK-28741 > URL: https://issues.apache.org/jira/browse/SPARK-28741 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > To follow ANSI SQL, we should support a configurable mode that throws > exceptions when casting to integers causes overflow. > The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, > which throws exceptions on arithmetical operation overflow. > To unify it, the configuration is renamed from > "spark.sql.arithmeticOperations.failOnOverFlow" to > "spark.sql.failOnIntegerOverFlow" -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28716) Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans
[ https://issues.apache.org/jira/browse/SPARK-28716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-28716. --- Fix Version/s: 3.0.0 Assignee: Ali Afroozeh Resolution: Fixed > Add id to Exchange and Subquery's stringArgs method for easier identifying > their reuses in query plans > -- > > Key: SPARK-28716 > URL: https://issues.apache.org/jira/browse/SPARK-28716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Minor > Fix For: 3.0.0 > > > Add id to Exchange and Subquery's stringArgs method for easier identifying > their reuses in query plans, for example: > {{ReusedExchange [d_date_sk#827|#827], BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) > [[id=#2710|#2710]]}} > Where {{2710}} is the id of the reused exchange. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28836) Remove the canonicalize(attributes) method from PlanExpression
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-28836. --- Fix Version/s: 3.0.0 Assignee: Ali Afroozeh Resolution: Fixed > Remove the canonicalize(attributes) method from PlanExpression > -- > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Minor > Fix For: 3.0.0 > > > The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat > confusing. > First, it is not clear why `PlanExpression` should have this method, and why > the canonicalization is not handled > by the canonicalized method of its parent, the Expression class. Second, the > QueryPlan.normalizeExpressionId > is the only place where PlanExpression.canonicalized is being called. > This PR removes the canonicalize method from the PlanExpression class and > delegates the normalization of expression ids to > the QueryPlan.normalizedExpressionId method. Also, the name > normalizedExpressions is more suitable for this method, > therefore, the method has also been renamed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914192#comment-16914192 ] Rahul Kulkarni commented on SPARK-18112: Getting this issue on Spark 2.4.3 with Hive 2.3.3. Also tried with Hive 2.3.5. Tried using explicit config variables spark.sql.hive.metastore.version and spark.sql.hive.metastore.jars, but no luck. cdh235m1:/spark-2.4.3-bin-without-hadoop/conf # spark-shell \ > --jars > /opt/teradata/tdqg/connector/tdqg-spark-connector/02.05.00.04-159/lib/spark-loaderfactory-02.05.00.04-159.jar, > \ > /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/log4j-api-2.7.jar, > \ > /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/log4j-core-2.7.jar, > \ > /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/qgc-spark-02.05.00.04-159.jar, > \ > /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/spark-loader-02.05.00.04-159.jar, > \ > /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/json-simple-1.1.1.jar > \ > --conf spark.sql.hive.metastore.version=2.3.3 \ > --conf spark.sql.hive.metastore.jars=/apache-hive-2.3.3-bin/lib/* \ > --master spark://cdh235m1.labs.teradata.com:7077 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/apache-hive-2.3.3-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 19/08/23 07:15:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [INFO] Unable to bind key for unsupported operation: backward-delete-word [INFO] Unable to bind key for unsupported operation: backward-delete-word [INFO] Unable to bind key for unsupported operation: down-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: down-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: down-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: down-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: down-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: down-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: down-history [INFO] Unable to bind key for unsupported operation: up-history [INFO] Unable to bind key for unsupported operation: down-history Spark context Web UI available at http://cdh235m1.labs.teradata.com:4040 Spark context available as 'sc' (master = spark://cdh235m1.labs.teradata.com:7077, app id = app-20190823071550-). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.3 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_202) Type in expressions to have them evaluated. Type :help for more information. scala> import tdqg.ForeignServer import tdqg.ForeignServer scala> ForeignServer.getDatasetFromSql("SELECT * FROM GDCData.Telco_Churn_Anal_Train_V") 19/08/23 07:16:03 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:204) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:285) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at
[jira] [Updated] (SPARK-28836) Remove the canonicalize(attributes) method from PlanExpression
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Afroozeh updated SPARK-28836: - Summary: Remove the canonicalize(attributes) method from PlanExpression (was: Remove the canonicalize() method ) > Remove the canonicalize(attributes) method from PlanExpression > -- > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat > confusing. > First, it is not clear why `PlanExpression` should have this method, and why > the canonicalization is not handled > by the canonicalized method of its parent, the Expression class. Second, the > QueryPlan.normalizeExpressionId > is the only place where PlanExpression.canonicalized is being called. > This PR removes the canonicalize method from the PlanExpression class and > delegates the normalization of expression ids to > the QueryPlan.normalizedExpressionId method. Also, the name > normalizedExpressions is more suitable for this method, > therefore, the method has also been renamed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28836) Remove the canonicalize() method
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Afroozeh updated SPARK-28836: - Summary: Remove the canonicalize() method (was: Improve canonicalize API) > Remove the canonicalize() method > - > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat > confusing. > First, it is not clear why `PlanExpression` should have this method, and why > the canonicalization is not handled > by the canonicalized method of its parent, the Expression class. Second, the > QueryPlan.normalizeExpressionId > is the only place where PlanExpression.canonicalized is being called. > This PR removes the canonicalize method from the PlanExpression class and > delegates the normalization of expression ids to > the QueryPlan.normalizedExpressionId method. Also, the name > normalizedExpressions is more suitable for this method, > therefore, the method has also been renamed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28836) Improve canonicalize API
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Afroozeh updated SPARK-28836: - Description: The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat confusing. First, it is not clear why `PlanExpression` should have this method, and why the canonicalization is not handled by the canonicalized method of its parent, the Expression class. Second, the QueryPlan.normalizeExpressionId is the only place where PlanExpression.canonicalized is being called. This PR removes the canonicalize method from the PlanExpression class and delegates the normalization of expression ids to the QueryPlan.normalizedExpressionId method. Also, the name normalizedExpressions is more suitable for this method, therefore, the method has also been renamed. was: The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat confusing. First, it is not clear why `PlanExpression` should have this method, and why the canonicalization is not handled by the canonicalized method of its parent, the Expression class. Second, the QueryPlan.normalizeExpressionId is the only place where PlanExpression.canonicalized is being called. This PR simplifies the canonicalize method on PlanExpression and delegates the normalization of expression ids to the QueryPlan.normalizedExpressionId method. Also, the name normalizedExpressions is more suitable for this method, therefore, the method has also been renamed. > Improve canonicalize API > > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat > confusing. > First, it is not clear why `PlanExpression` should have this method, and why > the canonicalization is not handled > by the canonicalized method of its parent, the Expression class. Second, the > QueryPlan.normalizeExpressionId > is the only place where PlanExpression.canonicalized is being called. > This PR removes the canonicalize method from the PlanExpression class and > delegates the normalization of expression ids to > the QueryPlan.normalizedExpressionId method. Also, the name > normalizedExpressions is more suitable for this method, > therefore, the method has also been renamed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28836) Improve canonicalize API
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Afroozeh updated SPARK-28836: - Description: The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat confusing. First, it is not clear why `PlanExpression` should have this method, and why the canonicalization is not handled by the canonicalized method of its parent, the Expression class. Second, the QueryPlan.normalizeExpressionId is the only place where PlanExpression.canonicalized is being called. This PR simplifies the canonicalize method on PlanExpression and delegates the normalization of expression ids to the QueryPlan.normalizedExpressionId method. Also, the name normalizedExpressions is more suitable for this method, therefore, the method has also been renamed. was:This PR improves the `canonicalize` API by removing the method `def canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and taking care of normalizing expressions in `QueryPlan`. > Improve canonicalize API > > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat > confusing. > First, it is not clear why `PlanExpression` should have this method, and why > the canonicalization is not handled > by the canonicalized method of its parent, the Expression class. Second, the > QueryPlan.normalizeExpressionId > is the only place where PlanExpression.canonicalized is being called. > This PR simplifies the canonicalize method on PlanExpression and delegates > the normalization of expression ids to > the QueryPlan.normalizedExpressionId method. Also, the name > normalizedExpressions is more suitable for this method, > therefore, the method has also been renamed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28851) Connect HBase using Spark SQL in Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-28851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914094#comment-16914094 ] Hyukjin Kwon commented on SPARK-28851: -- That will likely be an issue in HBase connector. For questions, mailing list is the best place to ask to mailing list, see https://spark.apache.org/community.html. > Connect HBase using Spark SQL in Spark 2.x > -- > > Key: SPARK-28851 > URL: https://issues.apache.org/jira/browse/SPARK-28851 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 >Reporter: ARUN KINDRA >Priority: Major > > Hi, > > I am basically trying a sample Spark SQL Code which actually read data from > Oracle and store it into the HBase. I found a spark-Hbase connector to write > a data into HBase where I need to provide a catalog. But it seems that it was > only available until Spark 1.6. Now, what is the way to connect to HBase > using Spark SqlContext. > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28851) Connect HBase using Spark SQL in Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-28851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28851. -- Resolution: Invalid > Connect HBase using Spark SQL in Spark 2.x > -- > > Key: SPARK-28851 > URL: https://issues.apache.org/jira/browse/SPARK-28851 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 >Reporter: ARUN KINDRA >Priority: Major > > Hi, > > I am basically trying a sample Spark SQL Code which actually read data from > Oracle and store it into the HBase. I found a spark-Hbase connector to write > a data into HBase where I need to provide a catalog. But it seems that it was > only available until Spark 1.6. Now, what is the way to connect to HBase > using Spark SqlContext. > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28854) Zipping iterators in mapPartitions will fail
[ https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28854. -- Resolution: Invalid > Zipping iterators in mapPartitions will fail > > > Key: SPARK-28854 > URL: https://issues.apache.org/jira/browse/SPARK-28854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Hao Yang Ang >Priority: Minor > > {code} > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => > xs.map(2*).zip(xs)).collect.foreach(println) > ... > java.util.NoSuchElementException: next on empty iterator > {code} > > > Workaround - implement zip with mapping to tuple: > {code} > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, > x))).collect.foreach(println) > (2,1) > (4,2) > (6,3) > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28854) Zipping iterators in mapPartitions will fail
[ https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914091#comment-16914091 ] Hyukjin Kwon commented on SPARK-28854: -- Seems {{xs}} is an iterator and it's all consumed at the first {{xs.map}}. I think you can do it via, for instance {code} sc.parallelize(Seq(1, 2, 3)).mapPartitions { xs => val xsa = xs.toArray xsa.map(2 *).zip(xsa).toIterator }.collect.foreach(println) {code} > Zipping iterators in mapPartitions will fail > > > Key: SPARK-28854 > URL: https://issues.apache.org/jira/browse/SPARK-28854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Hao Yang Ang >Priority: Minor > > {code} > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => > xs.map(2*).zip(xs)).collect.foreach(println) > ... > java.util.NoSuchElementException: next on empty iterator > {code} > > > Workaround - implement zip with mapping to tuple: > {code} > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, > x))).collect.foreach(println) > (2,1) > (4,2) > (6,3) > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28854) Zipping iterators in mapPartitions will fail
[ https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28854: - Description: {code} scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(2*).zip(xs)).collect.foreach(println) ... java.util.NoSuchElementException: next on empty iterator {code} Workaround - implement zip with mapping to tuple: {code} scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, x))).collect.foreach(println) (2,1) (4,2) (6,3) {code} was: scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(2*).zip(xs)).collect.foreach(println) warning: there was one feature warning; re-run with -feature for details 19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.util.NoSuchElementException: next on empty iterator Workaround - implement zip with mapping to tuple: scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, x))).collect.foreach(println) (2,1) (4,2) (6,3) > Zipping iterators in mapPartitions will fail > > > Key: SPARK-28854 > URL: https://issues.apache.org/jira/browse/SPARK-28854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Hao Yang Ang >Priority: Minor > > {code} > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => > xs.map(2*).zip(xs)).collect.foreach(println) > ... > java.util.NoSuchElementException: next on empty iterator > {code} > > > Workaround - implement zip with mapping to tuple: > {code} > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, > x))).collect.foreach(println) > (2,1) > (4,2) > (6,3) > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28854) Zipping iterators in mapPartitions will fail
[ https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28854: - Comment: was deleted (was: Your {{xs.map(2*)}} produces: {code} scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(2*)).collect.foreach(println) 2 4 6 {code} So, it cannot be zipped. {{zip}} in your codes is Scala library, not Spark.) > Zipping iterators in mapPartitions will fail > > > Key: SPARK-28854 > URL: https://issues.apache.org/jira/browse/SPARK-28854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Hao Yang Ang >Priority: Minor > > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => > xs.map(2*).zip(xs)).collect.foreach(println) > warning: there was one feature warning; re-run with -feature for details > 19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) > java.util.NoSuchElementException: next on empty iterator > > > Workaround - implement zip with mapping to tuple: > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, > x))).collect.foreach(println) > (2,1) > (4,2) > (6,3) > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28854) Zipping iterators in mapPartitions will fail
[ https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914085#comment-16914085 ] Hyukjin Kwon commented on SPARK-28854: -- Your {{xs.map(2*)}} produces: {code} scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(2*)).collect.foreach(println) 2 4 6 {code} So, it cannot be zipped. {{zip}} in your codes is Scala library, not Spark. > Zipping iterators in mapPartitions will fail > > > Key: SPARK-28854 > URL: https://issues.apache.org/jira/browse/SPARK-28854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Hao Yang Ang >Priority: Minor > > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => > xs.map(2*).zip(xs)).collect.foreach(println) > warning: there was one feature warning; re-run with -feature for details > 19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) > java.util.NoSuchElementException: next on empty iterator > > > Workaround - implement zip with mapping to tuple: > scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, > x))).collect.foreach(println) > (2,1) > (4,2) > (6,3) > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28857) Clean up the comments of PR template during merging
[ https://issues.apache.org/jira/browse/SPARK-28857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28857: Assignee: Dongjoon Hyun > Clean up the comments of PR template during merging > --- > > Key: SPARK-28857 > URL: https://issues.apache.org/jira/browse/SPARK-28857 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28857) Clean up the comments of PR template during merging
[ https://issues.apache.org/jira/browse/SPARK-28857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28857. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25564 [https://github.com/apache/spark/pull/25564] > Clean up the comments of PR template during merging > --- > > Key: SPARK-28857 > URL: https://issues.apache.org/jira/browse/SPARK-28857 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28836) Improve canonicalize API
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914059#comment-16914059 ] Hyukjin Kwon commented on SPARK-28836: -- Sure, that's fine but [~afroozeh] please clarify what issue it is in the JIRA description. What problem is there with {{canonicalize}} and what does this JIRA mean? > Improve canonicalize API > > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR improves the `canonicalize` API by removing the method `def > canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and > taking care of normalizing expressions in `QueryPlan`. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28860) Using ColumnStats of join key to get TableAccessCardinality when finding star joins in ReorderJoinRule
[ https://issues.apache.org/jira/browse/SPARK-28860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lai Zhou updated SPARK-28860: - Description: Now the star-schema detection uses TableAccessCardinality to reorder DimTables when there is a selectiveStarJoin . [StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341] {code:java} if (isSelectiveStarJoin(dimTables, conditions)) { val reorderDimTables = dimTables.map { plan => TableAccessCardinality(plan, getTableAccessCardinality(plan)) } .sortBy(_.size).map { case TableAccessCardinality(p1, _) => p1 }{code} But the getTableAccessCardinality method does't consider the ColumnStats of the equi-join-key. I'm not sure if we should compute Join cardinality for the dimTable based on it's join key here. [~ioana-delaney] was: Now the star-schema detection uses TableAccessCardinality to reorder DimTables when there is a selectiveStarJoin . [StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341] {code:java} if (isSelectiveStarJoin(dimTables, conditions)) { val reorderDimTables = dimTables.map { plan => TableAccessCardinality(plan, getTableAccessCardinality(plan)) }.sortBy(_.size).map { case TableAccessCardinality(p1, _) => p1 }{code} But the getTableAccessCardinality method does't consider the ColumnStats of the equi-join-key. I'm not sure if we should compute Join cardinality for the dimTable based on it's join key here. [~ioana-delaney] > Using ColumnStats of join key to get TableAccessCardinality when finding > star joins in ReorderJoinRule > --- > > Key: SPARK-28860 > URL: https://issues.apache.org/jira/browse/SPARK-28860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Lai Zhou >Priority: Minor > > Now the star-schema detection uses TableAccessCardinality to reorder > DimTables when there is a selectiveStarJoin . > [StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341] > {code:java} > if (isSelectiveStarJoin(dimTables, conditions)) { > val reorderDimTables = dimTables.map { > plan => TableAccessCardinality(plan, getTableAccessCardinality(plan)) } > .sortBy(_.size).map { > case TableAccessCardinality(p1, _) => p1 > }{code} > > But the getTableAccessCardinality method does't consider the ColumnStats of > the equi-join-key. I'm not sure if we should compute Join cardinality for the > dimTable based on it's join key here. > [~ioana-delaney] > > > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28860) Using ColumnStats of join key to get TableAccessCardinality when finding star joins in ReorderJoinRule
Lai Zhou created SPARK-28860: Summary: Using ColumnStats of join key to get TableAccessCardinality when finding star joins in ReorderJoinRule Key: SPARK-28860 URL: https://issues.apache.org/jira/browse/SPARK-28860 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Lai Zhou Now the star-schema detection uses TableAccessCardinality to reorder DimTables when there is a selectiveStarJoin . [StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341] {code:java} if (isSelectiveStarJoin(dimTables, conditions)) { val reorderDimTables = dimTables.map { plan => TableAccessCardinality(plan, getTableAccessCardinality(plan)) }.sortBy(_.size).map { case TableAccessCardinality(p1, _) => p1 }{code} But the getTableAccessCardinality method does't consider the ColumnStats of the equi-join-key. I'm not sure if we should compute Join cardinality for the dimTable based on it's join key here. [~ioana-delaney] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
[ https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-28859: - Affects Version/s: (was: 2.4.3) 3.0.0 > Remove value check of MEMORY_OFFHEAP_SIZE in declaration section > > > Key: SPARK-28859 > URL: https://issues.apache.org/jira/browse/SPARK-28859 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > > Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 > when > MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code? > > SPARK-28577 add this check before request memory resource to Yarn > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
[ https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914016#comment-16914016 ] Yang Jie commented on SPARK-28859: -- cc [~tgraves] > Remove value check of MEMORY_OFFHEAP_SIZE in declaration section > > > Key: SPARK-28859 > URL: https://issues.apache.org/jira/browse/SPARK-28859 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Yang Jie >Priority: Minor > > Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 > when > MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code? > > SPARK-28577 add this check before request memory resource to Yarn > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28642) Hide credentials in show create table
[ https://issues.apache.org/jira/browse/SPARK-28642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28642: -- Fix Version/s: 2.4.4 > Hide credentials in show create table > - > > Key: SPARK-28642 > URL: https://issues.apache.org/jira/browse/SPARK-28642 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > {code:sql} > spark-sql> show create table mysql_federated_sample; > CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, > `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, > `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` > STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN) > USING org.apache.spark.sql.jdbc > OPTIONS ( > `url` 'jdbc:mysql://localhost/hive?user=root=mypasswd', > `driver` 'com.mysql.jdbc.Driver', > `dbtable` 'TBLS' > ) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
Yang Jie created SPARK-28859: Summary: Remove value check of MEMORY_OFFHEAP_SIZE in declaration section Key: SPARK-28859 URL: https://issues.apache.org/jira/browse/SPARK-28859 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3 Reporter: Yang Jie Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 when MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code? SPARK-28577 add this check before request memory resource to Yarn -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28858) add tree-based transformation in the py side
zhengruifeng created SPARK-28858: Summary: add tree-based transformation in the py side Key: SPARK-28858 URL: https://issues.apache.org/jira/browse/SPARK-28858 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng Expose the newly add tree-based transformation in the py side -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28836) Improve canonicalize API
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913973#comment-16913973 ] Dongjoon Hyun commented on SPARK-28836: --- This issue content is switched with SPARK-28835. > Improve canonicalize API > > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR improves the `canonicalize` API by removing the method `def > canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and > taking care of normalizing expressions in `QueryPlan`. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28835) Introduce TPCDSSchema
[ https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28835. --- Fix Version/s: 3.0.0 Assignee: Ali Afroozeh Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25535 > Introduce TPCDSSchema > - > > Key: SPARK-28835 > URL: https://issues.apache.org/jira/browse/SPARK-28835 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Minor > Fix For: 3.0.0 > > > This PR extracts the schema information of TPCDS tables into a separate class > called `TPCDSSchema` which can be reused for other testing purposes -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28836) Improve canonicalize API
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28836: -- Description: This PR improves the `canonicalize` API by removing the method `def canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and taking care of normalizing expressions in `QueryPlan`. (was: This PR extracts the schema information of TPCDS tables into a separate class called `TPCDSSchema` which can be reused for other testing purposes) > Improve canonicalize API > > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR improves the `canonicalize` API by removing the method `def > canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and > taking care of normalizing expressions in `QueryPlan`. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28835) Introduce TPCDSSchema
[ https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28835: -- Description: This PR extracts the schema information of TPCDS tables into a separate class called `TPCDSSchema` which can be reused for other testing purposes (was: This PR improves the `canonicalize` API by removing the method `def canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and taking care of normalizing expressions in `QueryPlan`.) > Introduce TPCDSSchema > - > > Key: SPARK-28835 > URL: https://issues.apache.org/jira/browse/SPARK-28835 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR extracts the schema information of TPCDS tables into a separate class > called `TPCDSSchema` which can be reused for other testing purposes -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28836) Improve canonicalize API
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28836: -- Summary: Improve canonicalize API (was: Introduce TPCDSSchema) > Improve canonicalize API > > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR extracts the schema information of TPCDS tables into a separate class > called `TPCDSSchema` which can be reused for other testing purposes -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28835) Introduce TPCDSSchema
[ https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28835: -- Summary: Introduce TPCDSSchema (was: Improve canonicalize API) > Introduce TPCDSSchema > - > > Key: SPARK-28835 > URL: https://issues.apache.org/jira/browse/SPARK-28835 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR improves the `canonicalize` API by removing the method `def > canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and > taking care of normalizing expressions in `QueryPlan`. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-28836) Introduce TPCDSSchema
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-28836: --- > Introduce TPCDSSchema > - > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR extracts the schema information of TPCDS tables into a separate class > called `TPCDSSchema` which can be reused for other testing purposes -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28836) Introduce TPCDSSchema
[ https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913971#comment-16913971 ] Dongjoon Hyun commented on SPARK-28836: --- Oops. Sorry, [~hyukjin.kwon]. I merged this because the PR is open with a wrong Jira id. - https://github.com/apache/spark/pull/25535 > Introduce TPCDSSchema > - > > Key: SPARK-28836 > URL: https://issues.apache.org/jira/browse/SPARK-28836 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Priority: Minor > > This PR extracts the schema information of TPCDS tables into a separate class > called `TPCDSSchema` which can be reused for other testing purposes -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28319) DataSourceV2: Support SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-28319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28319. - Fix Version/s: 3.0.0 Assignee: Terry Kim Resolution: Fixed > DataSourceV2: Support SHOW TABLES > - > > Key: SPARK-28319 > URL: https://issues.apache.org/jira/browse/SPARK-28319 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > SHOW TABLES needs to support v2 catalogs. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-28025: Assignee: Jungtaek Lim > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-28025. -- Fix Version/s: 3.0.0 Resolution: Fixed > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913965#comment-16913965 ] hemanth meka commented on SPARK-23519: -- I have a fix for this. checkColumnNameDuplication is checking analyzed schema(id, id) whereas it should be checking aliased schema(int1, int2). I got it to work. I will run tests and submit a PR. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org