[jira] [Resolved] (SPARK-28723) Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28723.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25443
[https://github.com/apache/spark/pull/25443]

> Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile
> -
>
> Key: SPARK-28723
> URL: https://issues.apache.org/jira/browse/SPARK-28723
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-08-23 Thread kevin yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914795#comment-16914795
 ] 

kevin yu commented on SPARK-28833:
--

@aman omer : I am about the half way there.. and you can help review? thanks

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28778) Shuffle jobs fail due to incorrect advertised address when running in virtual network

2019-08-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914762#comment-16914762
 ] 

Dongjoon Hyun commented on SPARK-28778:
---

You are added to the Apache Spark contributor group. Thank you for your first 
contribution and welcome!

> Shuffle jobs fail due to incorrect advertised address when running in virtual 
> network
> -
>
> Key: SPARK-28778
> URL: https://issues.apache.org/jira/browse/SPARK-28778
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.3, 2.3.0, 2.4.3
>Reporter: Anton Kirillov
>Assignee: Anton Kirillov
>Priority: Major
>  Labels: Mesos
> Fix For: 3.0.0
>
>
> When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler 
> sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when 
> {{spark.mesos.network.name}} is provided. This makes executors use 
> {{0.0.0.0}} as their advertised address and, in the presence of shuffle, 
> executors fail to fetch shuffle blocks from each other using {{0.0.0.0}} as 
> the origin. When a virtual network is used the hostname or IP address is not 
> known upfront and assigned to a container at its start time so the executor 
> process needs to advertise the correct dynamically assigned address to be 
> reachable by other executors.
> h3.  
> The bug described above prevents Mesos users from running any jobs which 
> involve shuffle due to the inability of executors to fetch shuffle blocks 
> because of incorrect advertised address when virtual network is used.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28778) Shuffle jobs fail due to incorrect advertised address when running in virtual network

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28778:
-

Assignee: Anton Kirillov

> Shuffle jobs fail due to incorrect advertised address when running in virtual 
> network
> -
>
> Key: SPARK-28778
> URL: https://issues.apache.org/jira/browse/SPARK-28778
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.3, 2.3.0, 2.4.3
>Reporter: Anton Kirillov
>Assignee: Anton Kirillov
>Priority: Major
>  Labels: Mesos
> Fix For: 3.0.0
>
>
> When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler 
> sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when 
> {{spark.mesos.network.name}} is provided. This makes executors use 
> {{0.0.0.0}} as their advertised address and, in the presence of shuffle, 
> executors fail to fetch shuffle blocks from each other using {{0.0.0.0}} as 
> the origin. When a virtual network is used the hostname or IP address is not 
> known upfront and assigned to a container at its start time so the executor 
> process needs to advertise the correct dynamically assigned address to be 
> reachable by other executors.
> h3.  
> The bug described above prevents Mesos users from running any jobs which 
> involve shuffle due to the inability of executors to fetch shuffle blocks 
> because of incorrect advertised address when virtual network is used.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28778) Shuffle jobs fail due to incorrect advertised address when running in virtual network

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28778.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25500

> Shuffle jobs fail due to incorrect advertised address when running in virtual 
> network
> -
>
> Key: SPARK-28778
> URL: https://issues.apache.org/jira/browse/SPARK-28778
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.3, 2.3.0, 2.4.3
>Reporter: Anton Kirillov
>Priority: Major
>  Labels: Mesos
> Fix For: 3.0.0
>
>
> When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler 
> sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when 
> {{spark.mesos.network.name}} is provided. This makes executors use 
> {{0.0.0.0}} as their advertised address and, in the presence of shuffle, 
> executors fail to fetch shuffle blocks from each other using {{0.0.0.0}} as 
> the origin. When a virtual network is used the hostname or IP address is not 
> known upfront and assigned to a container at its start time so the executor 
> process needs to advertise the correct dynamically assigned address to be 
> reachable by other executors.
> h3.  
> The bug described above prevents Mesos users from running any jobs which 
> involve shuffle due to the inability of executors to fetch shuffle blocks 
> because of incorrect advertised address when virtual network is used.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28858) add tree-based transformation in the py side

2019-08-23 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-28858.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25566
[https://github.com/apache/spark/pull/25566]

> add tree-based transformation in the py side
> 
>
> Key: SPARK-28858
> URL: https://issues.apache.org/jira/browse/SPARK-28858
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> Expose the newly add tree-based transformation in the py side



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28858) add tree-based transformation in the py side

2019-08-23 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-28858:


Assignee: zhengruifeng

> add tree-based transformation in the py side
> 
>
> Key: SPARK-28858
> URL: https://issues.apache.org/jira/browse/SPARK-28858
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> Expose the newly add tree-based transformation in the py side



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-23 Thread hemanth meka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914660#comment-16914660
 ] 

hemanth meka edited comment on SPARK-23519 at 8/23/19 9:50 PM:
---

PR raised [25570|[https://github.com/apache/spark/pull/25570]] Someone please 
review.


was (Author: hem1891):
PR raised [25570|[https://github.com/apache/spark/pull/25570]]. Someone please 
review.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-23 Thread hemanth meka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914660#comment-16914660
 ] 

hemanth meka edited comment on SPARK-23519 at 8/23/19 9:50 PM:
---

PR raised [https://github.com/apache/spark/pull/25570] Someone please review.


was (Author: hem1891):
PR raised [25570|[https://github.com/apache/spark/pull/25570]] Someone please 
review.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-23 Thread hemanth meka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914660#comment-16914660
 ] 

hemanth meka commented on SPARK-23519:
--

PR raised [25570|[https://github.com/apache/spark/pull/25570]]. Someone please 
review.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28863) Add an AlreadyPlanned logical node that skips query planning

2019-08-23 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-28863:
---

 Summary: Add an AlreadyPlanned logical node that skips query 
planning
 Key: SPARK-28863
 URL: https://issues.apache.org/jira/browse/SPARK-28863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


With the DataSourceV2 write operations, we have a way to fallback to the V1 
writer APIs using InsertableRelation.

The gross part is that we're in physical land, but the InsertableRelation takes 
a logical plan, so we have to pass the logical plans to these physical nodes, 
and then potentially go through re-planning.

A useful primitive could be specifying that a plan is ready for execution 
through a logical node AlreadyPlanned. This would wrap a physical plan, and 
then we can go straight to execution.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28839) ExecutorMonitor$Tracker NullPointerException

2019-08-23 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28839:
--

Assignee: Hyukjin Kwon

> ExecutorMonitor$Tracker NullPointerException
> 
>
> Key: SPARK-28839
> URL: https://issues.apache.org/jira/browse/SPARK-28839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {noformat}
> 19/08/21 06:44:01 ERROR AsyncEventQueue: Listener ExecutorMonitor threw an 
> exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor$Tracker.removeShuffle(ExecutorMonitor.scala:479)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2(ExecutorMonitor.scala:408)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2$adapted(ExecutorMonitor.scala:407)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.cleanupShuffle(ExecutorMonitor.scala:407)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.onOtherEvent(ExecutorMonitor.scala:351)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:82)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:99)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:84)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:102)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:102)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:97)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:93)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:93)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28839) ExecutorMonitor$Tracker NullPointerException

2019-08-23 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28839.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25551
[https://github.com/apache/spark/pull/25551]

> ExecutorMonitor$Tracker NullPointerException
> 
>
> Key: SPARK-28839
> URL: https://issues.apache.org/jira/browse/SPARK-28839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {noformat}
> 19/08/21 06:44:01 ERROR AsyncEventQueue: Listener ExecutorMonitor threw an 
> exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor$Tracker.removeShuffle(ExecutorMonitor.scala:479)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2(ExecutorMonitor.scala:408)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2$adapted(ExecutorMonitor.scala:407)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.cleanupShuffle(ExecutorMonitor.scala:407)
>   at 
> org.apache.spark.scheduler.dynalloc.ExecutorMonitor.onOtherEvent(ExecutorMonitor.scala:351)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:82)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:99)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:84)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:102)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:102)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:97)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:93)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:93)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-08-23 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-28482.
--
Resolution: Not A Problem

No problem [~jiangyu1211] ! I will resolve this then. In general, I wouldn't 
rely on print outs from the workers. If they are run as a subprocess, then you 
won't see them. Not sure if that's what happened in your case or not.

> Data incomplete when using pandas udf in Python 3
> -
>
> Key: SPARK-28482
> URL: https://issues.apache.org/jira/browse/SPARK-28482
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
> Environment: centos 7.4   
> pyarrow 0.10.0 0.14.0
> python 2.7 3.5 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method 
> when using udf. However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !py2.7.png!   
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by jvm spark which have been added 
> by us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging {
>  case t: Throwable =>
>  // Purposefully not using NonFatal, because even fatal exceptions
>  // we don't want to have our finallyBlock suppress
> + logInfo(t.getLocalizedMessage)
> + t.printStackTrace()
>  originalThrowable = t
>  throw originalThrowable
>  } finally {{code}
>  
>  
>  It seems the pyspark get the data from jvm , but pyarrow get the data 
> incomplete. Pyarrow side think the data is finished, then shutdown the 
> socket. At the same time, the jvm side still writes to the same socket , but 
> get socket close exception.
>  The pyarrow part is in ipc.pxi:
>   
> {code:java}
> cdef class _RecordBatchReader:
>  cdef:
>  shared_ptr[CRecordBatchReader] reader
>  shared_ptr[InputStream] in_stream
> cdef readonly:
>  Schema schema
> def _cinit_(self):
>  pass
> def _open(self, source):
>  get_input_stream(source, _stream)
>  with nogil:
>  check_status(CRecordBatchStreamReader.Open(
>  self.in_stream.get(), ))
> self.schema = pyarrow_wrap_schema(self.reader.get().schema())
> def _iter_(self):
>  while True:
>  yield self.read_next_batch()
> def get_next_batch(self):
>  import warnings
>  warnings.warn('Please use read_next_batch 

[jira] [Comment Edited] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-08-23 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914462#comment-16914462
 ] 

Aman Omer edited comment on SPARK-28833 at 8/23/19 4:48 PM:


[~kevinyu98] Since I have worked on CREATE VIEW document 
([https://github.com/apache/spark/pull/25543]) and this is related.

If you permit, I would like to start working on this. :)


was (Author: aman_omer):
[~kevinyu98] Since I have worked on CREATE VIEW document 
([https://github.com/apache/spark/pull/25543]) and it is related. If you 
permit, I would like to start working on this. :)

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-08-23 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914462#comment-16914462
 ] 

Aman Omer edited comment on SPARK-28833 at 8/23/19 4:47 PM:


[~kevinyu98] Since I have worked on CREATE VIEW document 
([https://github.com/apache/spark/pull/25543]) and it is related. If you 
permit, I would like to start working on this. :)


was (Author: aman_omer):
[~kevinyu98] Since I have worked on CREATE VIEW document 
([https://github.com/apache/spark/pull/25543]). If you permit, I would like to 
start working on this. :)

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-08-23 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914462#comment-16914462
 ] 

Aman Omer commented on SPARK-28833:
---

[~kevinyu98] Since I have worked on CREATE VIEW document 
([https://github.com/apache/spark/pull/25543]). If you permit, I would like to 
start working on this. :)

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28862) Read Impala view with UNION ALL operator

2019-08-23 Thread francesco (Jira)
francesco created SPARK-28862:
-

 Summary: Read Impala view with UNION ALL operator
 Key: SPARK-28862
 URL: https://issues.apache.org/jira/browse/SPARK-28862
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: francesco


I would like to report an issue in pySpark 2.2.0 when it is used to read hive 
views that contain UNION ALL operator. 

 

I attach the error: 

WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try 
again. 

 

Is there any solution different from materializing this view? 


Thanks,



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted

2019-08-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914321#comment-16914321
 ] 

Dongjoon Hyun commented on SPARK-27330:
---

Thank you, [~eyalzit]. You are added to the Apache Spark contributor group.

> ForeachWriter is not being closed once a batch is aborted
> -
>
> Key: SPARK-27330
> URL: https://issues.apache.org/jira/browse/SPARK-27330
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Eyal Zituny
>Assignee: Eyal Zituny
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> in cases where a micro batch is being killed (interrupted), not during actual 
> processing done by the {{ForeachDataWriter}} (when iterating the iterator), 
> {{DataWritingSparkTask}} will handle the interruption and call 
> {{dataWriter.abort()}}
> the problem is that {{ForeachDataWriter}} has an empty implementation for the 
> abort method.
> due to that, I have tasks which uses the foreach writer and according to the 
> [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
>  they are opening connections in the "open" method and closing the 
> connections on the "close" method but since the "close" is never called, the 
> connections are never closed
> this wasn't the behavior pre spark 2.4
> my suggestion is to call {{ForeachWriter.abort()}} when 
> {{DataWriter.abort()}} is called,  in order to notify the foreach writer that 
> this task has failed
>  
> {code:java}
> stack trace from the exception i have encountered:
>  org.apache.spark.TaskKilledException: null
>  at 
> org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27330:
-

Assignee: Eyal Zituny

> ForeachWriter is not being closed once a batch is aborted
> -
>
> Key: SPARK-27330
> URL: https://issues.apache.org/jira/browse/SPARK-27330
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Eyal Zituny
>Assignee: Eyal Zituny
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> in cases where a micro batch is being killed (interrupted), not during actual 
> processing done by the {{ForeachDataWriter}} (when iterating the iterator), 
> {{DataWritingSparkTask}} will handle the interruption and call 
> {{dataWriter.abort()}}
> the problem is that {{ForeachDataWriter}} has an empty implementation for the 
> abort method.
> due to that, I have tasks which uses the foreach writer and according to the 
> [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
>  they are opening connections in the "open" method and closing the 
> connections on the "close" method but since the "close" is never called, the 
> connections are never closed
> this wasn't the behavior pre spark 2.4
> my suggestion is to call {{ForeachWriter.abort()}} when 
> {{DataWriter.abort()}} is called,  in order to notify the foreach writer that 
> this task has failed
>  
> {code:java}
> stack trace from the exception i have encountered:
>  org.apache.spark.TaskKilledException: null
>  at 
> org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27330.
---
Fix Version/s: 3.0.0
   2.4.4
   Resolution: Fixed

> ForeachWriter is not being closed once a batch is aborted
> -
>
> Key: SPARK-27330
> URL: https://issues.apache.org/jira/browse/SPARK-27330
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Eyal Zituny
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> in cases where a micro batch is being killed (interrupted), not during actual 
> processing done by the {{ForeachDataWriter}} (when iterating the iterator), 
> {{DataWritingSparkTask}} will handle the interruption and call 
> {{dataWriter.abort()}}
> the problem is that {{ForeachDataWriter}} has an empty implementation for the 
> abort method.
> due to that, I have tasks which uses the foreach writer and according to the 
> [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
>  they are opening connections in the "open" method and closing the 
> connections on the "close" method but since the "close" is never called, the 
> connections are never closed
> this wasn't the behavior pre spark 2.4
> my suggestion is to call {{ForeachWriter.abort()}} when 
> {{DataWriter.abort()}} is called,  in order to notify the foreach writer that 
> this task has failed
>  
> {code:java}
> stack trace from the exception i have encountered:
>  org.apache.spark.TaskKilledException: null
>  at 
> org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28025:
--
Fix Version/s: 2.4.4

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25474:
--
Fix Version/s: (was: 3.0.0)

> Support `spark.sql.statistics.fallBackToHdfs` in data source tables
> ---
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-25474:
---
  Assignee: (was: shahid)

> Support `spark.sql.statistics.fallBackToHdfs` in data source tables
> ---
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
> Fix For: 3.0.0
>
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28861) Jetty property handling

2019-08-23 Thread Ketan (Jira)
Ketan created SPARK-28861:
-

 Summary: Jetty property handling
 Key: SPARK-28861
 URL: https://issues.apache.org/jira/browse/SPARK-28861
 Project: Spark
  Issue Type: Wish
  Components: Spark Submit
Affects Versions: 2.4.3
Reporter: Ketan


While processing data from certain files a NumberFormatExceltion was seen in 
the logs. The processing was fine but the following stacktrace was observed:

{"time":"2019-08-16 
08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""}
java.lang.NumberFormatException: For input string: "unknown".

On investigation it is found that in the class Jetty there is the following:

BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", 
"unknown")); 

which indicates that the config should have the 'timestamp' property. If the 
property is not there then the default value is set as 'unknown' and this value 
causes the stacktrace to show up in the logs in our application. It has no 
detrimental effect on the application as such but could be addressed.
 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28741) New optional mode: Throw exceptions when casting to integers causes overflow

2019-08-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28741.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25461
[https://github.com/apache/spark/pull/25461]

> New optional mode: Throw exceptions when casting to integers causes overflow
> 
>
> Key: SPARK-28741
> URL: https://issues.apache.org/jira/browse/SPARK-28741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> To follow ANSI SQL, we should support a configurable mode that throws 
> exceptions when casting to integers causes overflow.
> The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, 
> which throws exceptions on arithmetical operation overflow.
> To unify it, the configuration is renamed from 
> "spark.sql.arithmeticOperations.failOnOverFlow" to 
> "spark.sql.failOnIntegerOverFlow"



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28741) New optional mode: Throw exceptions when casting to integers causes overflow

2019-08-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28741:
---

Assignee: Gengliang Wang

> New optional mode: Throw exceptions when casting to integers causes overflow
> 
>
> Key: SPARK-28741
> URL: https://issues.apache.org/jira/browse/SPARK-28741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> To follow ANSI SQL, we should support a configurable mode that throws 
> exceptions when casting to integers causes overflow.
> The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, 
> which throws exceptions on arithmetical operation overflow.
> To unify it, the configuration is renamed from 
> "spark.sql.arithmeticOperations.failOnOverFlow" to 
> "spark.sql.failOnIntegerOverFlow"



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28716) Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans

2019-08-23 Thread Herman van Hovell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-28716.
---
Fix Version/s: 3.0.0
 Assignee: Ali Afroozeh
   Resolution: Fixed

> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans
> --
>
> Key: SPARK-28716
> URL: https://issues.apache.org/jira/browse/SPARK-28716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans, for example:
> {{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
> [[id=#2710|#2710]]}}
> Where {{2710}} is the id of the reused exchange.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28836) Remove the canonicalize(attributes) method from PlanExpression

2019-08-23 Thread Herman van Hovell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-28836.
---
Fix Version/s: 3.0.0
 Assignee: Ali Afroozeh
   Resolution: Fixed

> Remove the canonicalize(attributes) method from PlanExpression
> --
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
>  First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
>  by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
>  is the only place where PlanExpression.canonicalized is being called.
> This PR removes the canonicalize method from the PlanExpression class and 
> delegates the normalization of expression ids to
>  the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
>  therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2019-08-23 Thread Rahul Kulkarni (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914192#comment-16914192
 ] 

Rahul Kulkarni commented on SPARK-18112:


Getting this issue on Spark 2.4.3 with Hive 2.3.3. Also tried with Hive 2.3.5. 
Tried using explicit config variables spark.sql.hive.metastore.version and 
spark.sql.hive.metastore.jars, but no luck.

 

cdh235m1:/spark-2.4.3-bin-without-hadoop/conf # spark-shell \
> --jars 
> /opt/teradata/tdqg/connector/tdqg-spark-connector/02.05.00.04-159/lib/spark-loaderfactory-02.05.00.04-159.jar,
>  \
> /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/log4j-api-2.7.jar,
>  \
> /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/log4j-core-2.7.jar,
>  \
> /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/qgc-spark-02.05.00.04-159.jar,
>  \
> /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/spark-loader-02.05.00.04-159.jar,
>  \
> /opt/teradata/tdqg/connector/tdqgspark-connector/02.05.00.04-159/lib/json-simple-1.1.1.jar
>  \
> --conf spark.sql.hive.metastore.version=2.3.3 \
> --conf spark.sql.hive.metastore.jars=/apache-hive-2.3.3-bin/lib/* \
> --master spark://cdh235m1.labs.teradata.com:7077
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/apache-hive-2.3.3-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/08/23 07:15:43 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
Spark context Web UI available at http://cdh235m1.labs.teradata.com:4040
Spark context available as 'sc' (master = 
spark://cdh235m1.labs.teradata.com:7077, app id = app-20190823071550-).
Spark session available as 'spark'.
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.4.3
 /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import tdqg.ForeignServer
import tdqg.ForeignServer

scala> ForeignServer.getDatasetFromSql("SELECT * FROM 
GDCData.Telco_Churn_Anal_Train_V")
19/08/23 07:16:03 WARN SparkSession$Builder: Using an existing SparkSession; 
some configuration may not take effect.
java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
 at 
org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:204)
 at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:285)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
 at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
 at 

[jira] [Updated] (SPARK-28836) Remove the canonicalize(attributes) method from PlanExpression

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Summary: Remove the canonicalize(attributes) method from PlanExpression  
(was: Remove the canonicalize() method )

> Remove the canonicalize(attributes) method from PlanExpression
> --
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
>  First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
>  by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
>  is the only place where PlanExpression.canonicalized is being called.
> This PR removes the canonicalize method from the PlanExpression class and 
> delegates the normalization of expression ids to
>  the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
>  therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Remove the canonicalize() method

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Summary: Remove the canonicalize() method   (was: Improve canonicalize API)

> Remove the canonicalize() method 
> -
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
>  First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
>  by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
>  is the only place where PlanExpression.canonicalized is being called.
> This PR removes the canonicalize method from the PlanExpression class and 
> delegates the normalization of expression ids to
>  the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
>  therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Description: 
The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
confusing. 
 First, it is not clear why `PlanExpression` should have this method, and why 
the canonicalization is not handled
 by the canonicalized method of its parent, the Expression class. Second, the 
QueryPlan.normalizeExpressionId
 is the only place where PlanExpression.canonicalized is being called.

This PR removes the canonicalize method from the PlanExpression class and 
delegates the normalization of expression ids to
 the QueryPlan.normalizedExpressionId method. Also, the name 
normalizedExpressions is more suitable for this method,
 therefore, the method has also been renamed.

  was:
The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
confusing. 
First, it is not clear why `PlanExpression` should have this method, and why 
the canonicalization is not handled
by the canonicalized method of its parent, the Expression class. Second, the 
QueryPlan.normalizeExpressionId
is the only place where PlanExpression.canonicalized is being called.

This PR simplifies the canonicalize method on PlanExpression and delegates the 
normalization of expression ids to
the QueryPlan.normalizedExpressionId method. Also, the name 
normalizedExpressions is more suitable for this method,
therefore, the method has also been renamed.


> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
>  First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
>  by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
>  is the only place where PlanExpression.canonicalized is being called.
> This PR removes the canonicalize method from the PlanExpression class and 
> delegates the normalization of expression ids to
>  the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
>  therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Description: 
The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
confusing. 
First, it is not clear why `PlanExpression` should have this method, and why 
the canonicalization is not handled
by the canonicalized method of its parent, the Expression class. Second, the 
QueryPlan.normalizeExpressionId
is the only place where PlanExpression.canonicalized is being called.

This PR simplifies the canonicalize method on PlanExpression and delegates the 
normalization of expression ids to
the QueryPlan.normalizedExpressionId method. Also, the name 
normalizedExpressions is more suitable for this method,
therefore, the method has also been renamed.

  was:This PR improves the `canonicalize` API by removing the method `def 
canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
taking care of normalizing expressions in `QueryPlan`.


> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
> First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
> by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
> is the only place where PlanExpression.canonicalized is being called.
> This PR simplifies the canonicalize method on PlanExpression and delegates 
> the normalization of expression ids to
> the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
> therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28851) Connect HBase using Spark SQL in Spark 2.x

2019-08-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914094#comment-16914094
 ] 

Hyukjin Kwon commented on SPARK-28851:
--

That will likely be an issue in HBase connector. For questions, mailing list is 
the best place to ask to mailing list, see 
https://spark.apache.org/community.html.

> Connect HBase using Spark SQL in Spark 2.x
> --
>
> Key: SPARK-28851
> URL: https://issues.apache.org/jira/browse/SPARK-28851
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ARUN KINDRA
>Priority: Major
>
> Hi,
>  
> I am basically trying a sample Spark SQL Code which actually read data from 
> Oracle and store it into the HBase. I found a spark-Hbase connector to write 
> a data into HBase where I need to provide a catalog. But it seems that it was 
> only available until Spark 1.6. Now, what is the way to connect to HBase 
> using Spark SqlContext.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28851) Connect HBase using Spark SQL in Spark 2.x

2019-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28851.
--
Resolution: Invalid

> Connect HBase using Spark SQL in Spark 2.x
> --
>
> Key: SPARK-28851
> URL: https://issues.apache.org/jira/browse/SPARK-28851
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ARUN KINDRA
>Priority: Major
>
> Hi,
>  
> I am basically trying a sample Spark SQL Code which actually read data from 
> Oracle and store it into the HBase. I found a spark-Hbase connector to write 
> a data into HBase where I need to provide a catalog. But it seems that it was 
> only available until Spark 1.6. Now, what is the way to connect to HBase 
> using Spark SqlContext.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28854.
--
Resolution: Invalid

> Zipping iterators in mapPartitions will fail
> 
>
> Key: SPARK-28854
> URL: https://issues.apache.org/jira/browse/SPARK-28854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Hao Yang Ang
>Priority: Minor
>
> {code}
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
> xs.map(2*).zip(xs)).collect.foreach(println)
> ...
> java.util.NoSuchElementException: next on empty iterator
> {code}
>  
>  
> Workaround - implement zip with mapping to tuple:
> {code}
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
> x))).collect.foreach(println)
> (2,1)
> (4,2)
> (6,3)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914091#comment-16914091
 ] 

Hyukjin Kwon commented on SPARK-28854:
--

Seems {{xs}} is an iterator and it's all consumed at the first {{xs.map}}. I 
think you can do it via, for instance

{code}
sc.parallelize(Seq(1, 2, 3)).mapPartitions { xs =>
  val xsa = xs.toArray
  xsa.map(2 *).zip(xsa).toIterator
}.collect.foreach(println)
{code}

> Zipping iterators in mapPartitions will fail
> 
>
> Key: SPARK-28854
> URL: https://issues.apache.org/jira/browse/SPARK-28854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Hao Yang Ang
>Priority: Minor
>
> {code}
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
> xs.map(2*).zip(xs)).collect.foreach(println)
> ...
> java.util.NoSuchElementException: next on empty iterator
> {code}
>  
>  
> Workaround - implement zip with mapping to tuple:
> {code}
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
> x))).collect.foreach(println)
> (2,1)
> (4,2)
> (6,3)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28854:
-
Description: 
{code}
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*).zip(xs)).collect.foreach(println)

...

java.util.NoSuchElementException: next on empty iterator
{code}

 

 

Workaround - implement zip with mapping to tuple:

{code}
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
x))).collect.foreach(println)
(2,1)
(4,2)
(6,3)
{code}

 

  was:
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*).zip(xs)).collect.foreach(println)

warning: there was one feature warning; re-run with -feature for details

19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)

java.util.NoSuchElementException: next on empty iterator

 

 

Workaround - implement zip with mapping to tuple:

scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
x))).collect.foreach(println)

(2,1)

(4,2)

(6,3)

 


> Zipping iterators in mapPartitions will fail
> 
>
> Key: SPARK-28854
> URL: https://issues.apache.org/jira/browse/SPARK-28854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Hao Yang Ang
>Priority: Minor
>
> {code}
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
> xs.map(2*).zip(xs)).collect.foreach(println)
> ...
> java.util.NoSuchElementException: next on empty iterator
> {code}
>  
>  
> Workaround - implement zip with mapping to tuple:
> {code}
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
> x))).collect.foreach(println)
> (2,1)
> (4,2)
> (6,3)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28854:
-
Comment: was deleted

(was: Your {{xs.map(2*)}} produces:

{code}
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*)).collect.foreach(println)
2
4
6
{code}

So, it cannot be zipped. {{zip}} in your codes is Scala library, not Spark.)

> Zipping iterators in mapPartitions will fail
> 
>
> Key: SPARK-28854
> URL: https://issues.apache.org/jira/browse/SPARK-28854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Hao Yang Ang
>Priority: Minor
>
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
> xs.map(2*).zip(xs)).collect.foreach(println)
> warning: there was one feature warning; re-run with -feature for details
> 19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
> java.util.NoSuchElementException: next on empty iterator
>  
>  
> Workaround - implement zip with mapping to tuple:
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
> x))).collect.foreach(println)
> (2,1)
> (4,2)
> (6,3)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914085#comment-16914085
 ] 

Hyukjin Kwon commented on SPARK-28854:
--

Your {{xs.map(2*)}} produces:

{code}
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*)).collect.foreach(println)
2
4
6
{code}

So, it cannot be zipped. {{zip}} in your codes is Scala library, not Spark.

> Zipping iterators in mapPartitions will fail
> 
>
> Key: SPARK-28854
> URL: https://issues.apache.org/jira/browse/SPARK-28854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Hao Yang Ang
>Priority: Minor
>
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
> xs.map(2*).zip(xs)).collect.foreach(println)
> warning: there was one feature warning; re-run with -feature for details
> 19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
> java.util.NoSuchElementException: next on empty iterator
>  
>  
> Workaround - implement zip with mapping to tuple:
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
> x))).collect.foreach(println)
> (2,1)
> (4,2)
> (6,3)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28857) Clean up the comments of PR template during merging

2019-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28857:


Assignee: Dongjoon Hyun

> Clean up the comments of PR template during merging
> ---
>
> Key: SPARK-28857
> URL: https://issues.apache.org/jira/browse/SPARK-28857
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28857) Clean up the comments of PR template during merging

2019-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28857.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25564
[https://github.com/apache/spark/pull/25564]

> Clean up the comments of PR template during merging
> ---
>
> Key: SPARK-28857
> URL: https://issues.apache.org/jira/browse/SPARK-28857
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914059#comment-16914059
 ] 

Hyukjin Kwon commented on SPARK-28836:
--

Sure, that's fine but [~afroozeh] please clarify what issue it is in the JIRA 
description. What problem is there with {{canonicalize}} and what does this 
JIRA mean?

> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR improves the `canonicalize` API by removing the method `def 
> canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
> taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28860) Using ColumnStats of join key to get TableAccessCardinality when finding star joins in ReorderJoinRule

2019-08-23 Thread Lai Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lai Zhou updated SPARK-28860:
-
Description: 
Now the star-schema detection uses TableAccessCardinality to reorder DimTables  
when there is a selectiveStarJoin . 

[StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341]
{code:java}
if (isSelectiveStarJoin(dimTables, conditions)) { 
val reorderDimTables = dimTables.map { 
plan => TableAccessCardinality(plan, getTableAccessCardinality(plan)) }
.sortBy(_.size).map { 
case TableAccessCardinality(p1, _) => p1
 }{code}
 

But the getTableAccessCardinality method does't consider the ColumnStats of the 
equi-join-key. I'm not sure if we should compute Join cardinality for the 
dimTable based on it's join key here.

[~ioana-delaney]

 

 

 

 

  was:
Now the star-schema detection uses TableAccessCardinality to reorder DimTables  
when there is a selectiveStarJoin . 

[StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341]

 
{code:java}
if (isSelectiveStarJoin(dimTables, conditions)) { 
val reorderDimTables = dimTables.map { plan => TableAccessCardinality(plan, 
getTableAccessCardinality(plan)) }.sortBy(_.size).map { case 
TableAccessCardinality(p1, _) => p1 }{code}
 

 

But the getTableAccessCardinality method does't consider the ColumnStats of the 
equi-join-key. I'm not sure if we should compute Join cardinality for the 
dimTable based on it's

join key here.

[~ioana-delaney]

 

 

 

 


>  Using ColumnStats of join key to get TableAccessCardinality when finding 
> star joins in ReorderJoinRule
> ---
>
> Key: SPARK-28860
> URL: https://issues.apache.org/jira/browse/SPARK-28860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Lai Zhou
>Priority: Minor
>
> Now the star-schema detection uses TableAccessCardinality to reorder 
> DimTables  when there is a selectiveStarJoin . 
> [StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341]
> {code:java}
> if (isSelectiveStarJoin(dimTables, conditions)) { 
> val reorderDimTables = dimTables.map { 
> plan => TableAccessCardinality(plan, getTableAccessCardinality(plan)) }
> .sortBy(_.size).map { 
> case TableAccessCardinality(p1, _) => p1
>  }{code}
>  
> But the getTableAccessCardinality method does't consider the ColumnStats of 
> the equi-join-key. I'm not sure if we should compute Join cardinality for the 
> dimTable based on it's join key here.
> [~ioana-delaney]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28860) Using ColumnStats of join key to get TableAccessCardinality when finding star joins in ReorderJoinRule

2019-08-23 Thread Lai Zhou (Jira)
Lai Zhou created SPARK-28860:


 Summary:  Using ColumnStats of join key to get 
TableAccessCardinality when finding star joins in ReorderJoinRule
 Key: SPARK-28860
 URL: https://issues.apache.org/jira/browse/SPARK-28860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Lai Zhou


Now the star-schema detection uses TableAccessCardinality to reorder DimTables  
when there is a selectiveStarJoin . 

[StarSchemaDetection.scala#L341|https://github.com/apache/spark/blob/98e1a4cea44d7cb2f6d502c0202ad3cac2a1ad8d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala#L341]

 
{code:java}
if (isSelectiveStarJoin(dimTables, conditions)) { 
val reorderDimTables = dimTables.map { plan => TableAccessCardinality(plan, 
getTableAccessCardinality(plan)) }.sortBy(_.size).map { case 
TableAccessCardinality(p1, _) => p1 }{code}
 

 

But the getTableAccessCardinality method does't consider the ColumnStats of the 
equi-join-key. I'm not sure if we should compute Join cardinality for the 
dimTable based on it's

join key here.

[~ioana-delaney]

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section

2019-08-23 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-28859:
-
Affects Version/s: (was: 2.4.3)
   3.0.0

> Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
> 
>
> Key: SPARK-28859
> URL: https://issues.apache.org/jira/browse/SPARK-28859
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 
> when 
> MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code?
>  
> SPARK-28577 add this check before request memory resource to Yarn 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section

2019-08-23 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914016#comment-16914016
 ] 

Yang Jie commented on SPARK-28859:
--

cc [~tgraves] 

> Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
> 
>
> Key: SPARK-28859
> URL: https://issues.apache.org/jira/browse/SPARK-28859
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Yang Jie
>Priority: Minor
>
> Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 
> when 
> MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code?
>  
> SPARK-28577 add this check before request memory resource to Yarn 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28642) Hide credentials in show create table

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28642:
--
Fix Version/s: 2.4.4

> Hide credentials in show create table
> -
>
> Key: SPARK-28642
> URL: https://issues.apache.org/jira/browse/SPARK-28642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> {code:sql}
> spark-sql> show create table mysql_federated_sample;
> CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, 
> `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, 
> `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` 
> STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN)
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> `url` 'jdbc:mysql://localhost/hive?user=root=mypasswd',
> `driver` 'com.mysql.jdbc.Driver',
> `dbtable` 'TBLS'
> )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section

2019-08-23 Thread Yang Jie (Jira)
Yang Jie created SPARK-28859:


 Summary: Remove value check of MEMORY_OFFHEAP_SIZE in declaration 
section
 Key: SPARK-28859
 URL: https://issues.apache.org/jira/browse/SPARK-28859
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Yang Jie


Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 
when 

MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code?

 

SPARK-28577 add this check before request memory resource to Yarn 

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28858) add tree-based transformation in the py side

2019-08-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-28858:


 Summary: add tree-based transformation in the py side
 Key: SPARK-28858
 URL: https://issues.apache.org/jira/browse/SPARK-28858
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Expose the newly add tree-based transformation in the py side



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913973#comment-16913973
 ] 

Dongjoon Hyun commented on SPARK-28836:
---

This issue content is switched with SPARK-28835.

> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR improves the `canonicalize` API by removing the method `def 
> canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
> taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28835) Introduce TPCDSSchema

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28835.
---
Fix Version/s: 3.0.0
 Assignee: Ali Afroozeh
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25535

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28835
> URL: https://issues.apache.org/jira/browse/SPARK-28835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28836:
--
Description: This PR improves the `canonicalize` API by removing the method 
`def canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` 
and taking care of normalizing expressions in `QueryPlan`.  (was: This PR 
extracts the schema information of TPCDS tables into a separate class called 
`TPCDSSchema` which can be reused for other testing purposes)

> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR improves the `canonicalize` API by removing the method `def 
> canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
> taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28835) Introduce TPCDSSchema

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28835:
--
Description: This PR extracts the schema information of TPCDS tables into a 
separate class called `TPCDSSchema` which can be reused for other testing 
purposes  (was: This PR improves the `canonicalize` API by removing the method 
`def canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` 
and taking care of normalizing expressions in `QueryPlan`.)

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28835
> URL: https://issues.apache.org/jira/browse/SPARK-28835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28836:
--
Summary: Improve canonicalize API  (was: Introduce TPCDSSchema)

> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28835) Introduce TPCDSSchema

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28835:
--
Summary: Introduce TPCDSSchema  (was: Improve canonicalize API)

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28835
> URL: https://issues.apache.org/jira/browse/SPARK-28835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR improves the `canonicalize` API by removing the method `def 
> canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
> taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28836) Introduce TPCDSSchema

2019-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-28836:
---

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28836) Introduce TPCDSSchema

2019-08-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913971#comment-16913971
 ] 

Dongjoon Hyun commented on SPARK-28836:
---

Oops. Sorry, [~hyukjin.kwon]. I merged this because the PR is open with a wrong 
Jira id.

- https://github.com/apache/spark/pull/25535

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28319) DataSourceV2: Support SHOW TABLES

2019-08-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28319.
-
Fix Version/s: 3.0.0
 Assignee: Terry Kim
   Resolution: Fixed

> DataSourceV2: Support SHOW TABLES
> -
>
> Key: SPARK-28319
> URL: https://issues.apache.org/jira/browse/SPARK-28319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> SHOW TABLES needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-23 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-28025:


Assignee: Jungtaek Lim

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-23 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-28025.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-23 Thread hemanth meka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913965#comment-16913965
 ] 

hemanth meka commented on SPARK-23519:
--

I have a fix for this. checkColumnNameDuplication is checking analyzed 
schema(id, id) whereas it should be checking aliased schema(int1, int2). I got 
it to work. I will run tests and submit a PR.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org