date:20191115

[jira] [Resolved] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29923.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26552
[https://github.com/apache/spark/pull/26552]

> Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
> 
>
> Key: SPARK-29923
> URL: https://issues.apache.org/jira/browse/SPARK-29923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29923:
-

Assignee: Dongjoon Hyun

> Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
> 
>
> Key: SPARK-29923
> URL: https://issues.apache.org/jira/browse/SPARK-29923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26362:
--
Labels: release-notes  (was: releasenotes)

> Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark 
> contexts
> ---
>
> Key: SPARK-26362
> URL: https://issues.apache.org/jira/browse/SPARK-26362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Multiple Spark contexts are discouraged and it has been warning from 4 years 
> ago (see SPARK-4180).
> It could cause arbitrary and mysterious error cases. (Honestly, I didn't even 
> know Spark allows it). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26651:
--
Labels: release-notes  (was: ReleaseNote)

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: release-notes
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-15 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975609#comment-16975609
 ] 

Terry Kim commented on SPARK-29890:
---

Sure. I will take a look.

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-15 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975606#comment-16975606
 ] 

Wenchen Fan commented on SPARK-29890:
-

seems like another self-join bug. [~imback82] can you take a look?

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29867) add repr in Python ML Models

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29867.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26489
[https://github.com/apache/spark/pull/26489]

> add __repr__ in Python ML Models
> 
>
> Key: SPARK-29867
> URL: https://issues.apache.org/jira/browse/SPARK-29867
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> In Python ML Models, some of them have ___repr, others don't. In the 
> doctest, when calling Model.setXXX, some of the Models print out the 
> xxxModel... correctly, some of them can't because of lacking the  repr___ 
> method. This Jira addresses this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29867) add repr in Python ML Models

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29867:
-

Assignee: Huaxin Gao

> add __repr__ in Python ML Models
> 
>
> Key: SPARK-29867
> URL: https://issues.apache.org/jira/browse/SPARK-29867
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> In Python ML Models, some of them have ___repr, others don't. In the 
> doctest, when calling Model.setXXX, some of the Models print out the 
> xxxModel... correctly, some of them can't because of lacking the  repr___ 
> method. This Jira addresses this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-15 Thread sandeshyapuram (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29890:
---
Affects Version/s: 2.4.3

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29834) DESC DATABASE should look up catalog like v2 commands

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29834:
-

Assignee: Hu Fuwang

> DESC DATABASE should look up catalog like v2 commands
> -
>
> Key: SPARK-29834
> URL: https://issues.apache.org/jira/browse/SPARK-29834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29834) DESC DATABASE should look up catalog like v2 commands

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29834.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26513
[https://github.com/apache/spark/pull/26513]

> DESC DATABASE should look up catalog like v2 commands
> -
>
> Key: SPARK-29834
> URL: https://issues.apache.org/jira/browse/SPARK-29834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29127.
---
Fix Version/s: 3.0.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26538

> Add a Python, Pandas and PyArrow versions in clue at SQL query tests
> 
>
> Key: SPARK-29127
> URL: https://issues.apache.org/jira/browse/SPARK-29127
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Once Python test cases is failed in integrated UDF test cases, it's difficult 
> to find out the version informations. See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
>  as an example
> It might be better to add the version information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29127:
--
Reporter: Hyukjin Kwon  (was: Burak Yavuz)

> Add a Python, Pandas and PyArrow versions in clue at SQL query tests
> 
>
> Key: SPARK-29127
> URL: https://issues.apache.org/jira/browse/SPARK-29127
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Once Python test cases is failed in integrated UDF test cases, it's difficult 
> to find out the version informations. See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
>  as an example
> It might be better to add the version information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29908) Add a Python, Pandas and PyArrow versions in clue at SQL query tests

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29908:
--
Priority: Blocker  (was: Major)

> Add a Python, Pandas and PyArrow versions in clue at SQL query tests
> 
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> Once Python test cases is failed in integrated UDF test cases, it's difficult 
> to find out the version informations. See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
>  as an example
> It might be better to add the version information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29127:
--
Summary: Add a Python, Pandas and PyArrow versions in clue at SQL query 
tests  (was: Support partitioning for DataSource V2 tables in 
DataFrameWriter.save)

> Add a Python, Pandas and PyArrow versions in clue at SQL query tests
> 
>
> Key: SPARK-29127
> URL: https://issues.apache.org/jira/browse/SPARK-29127
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29127:
--
Description: 
Once Python test cases is failed in integrated UDF test cases, it's difficult 
to find out the version informations. See 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
 as an example

It might be better to add the version information.

  was:Currently, any data source that that upgrades to DataSource V2 loses the 
partition transform information when using DataFrameWriter.save. The main 
reason is the lack of an API for "creating" a table with partitioning and 
schema information for V2 tables without a catalog.


> Add a Python, Pandas and PyArrow versions in clue at SQL query tests
> 
>
> Key: SPARK-29127
> URL: https://issues.apache.org/jira/browse/SPARK-29127
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> Once Python test cases is failed in integrated UDF test cases, it's difficult 
> to find out the version informations. See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
>  as an example
> It might be better to add the version information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29908:
--
Summary: Support partitioning for DataSource V2 tables in 
DataFrameWriter.save  (was: Add a Python, Pandas and PyArrow versions in clue 
at SQL query tests)

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> Once Python test cases is failed in integrated UDF test cases, it's difficult 
> to find out the version informations. See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
>  as an example
> It might be better to add the version information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29908) Add a Python, Pandas and PyArrow versions in clue at SQL query tests

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29908:
--
Component/s: (was: PySpark)

> Add a Python, Pandas and PyArrow versions in clue at SQL query tests
> 
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> Once Python test cases is failed in integrated UDF test cases, it's difficult 
> to find out the version informations. See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
>  as an example
> It might be better to add the version information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29127:
--
Priority: Major  (was: Blocker)

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29127
> URL: https://issues.apache.org/jira/browse/SPARK-29127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29127:
--
Component/s: PySpark

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29127
> URL: https://issues.apache.org/jira/browse/SPARK-29127
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29908:
--
Description: Currently, any data source that that upgrades to DataSource V2 
loses the partition transform information when using DataFrameWriter.save. The 
main reason is the lack of an API for "creating" a table with partitioning and 
schema information for V2 tables without a catalog.  (was: Once Python test 
cases is failed in integrated UDF test cases, it's difficult to find out the 
version informations. See 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
 as an example

It might be better to add the version information.)

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29908:
--
Reporter: Burak Yavuz  (was: Hyukjin Kwon)

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2019-11-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975539#comment-16975539
 ] 

Dongjoon Hyun commented on SPARK-29127:
---

Hi, [~brkyvz] and [~hyukjin.kwon]. 
Sorry, but I'll switch the both JIRA issue IDs due to the following.
- 
https://github.com/apache/spark/commit/7720781695d47fe0375f6e1150f6981b886686bd

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29127
> URL: https://issues.apache.org/jira/browse/SPARK-29127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29924) Document Arrow requirement in JDK9+

2019-11-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975537#comment-16975537
 ] 

Dongjoon Hyun commented on SPARK-29924:
---

cc [~bryanc]

> Document Arrow requirement in JDK9+
> ---
>
> Key: SPARK-29924
> URL: https://issues.apache.org/jira/browse/SPARK-29924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is 
> required for Arrow runtime on JDK9+ environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29924) Document Arrow requirement in JDK9+

2019-11-15 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29924:
-

 Summary: Document Arrow requirement in JDK9+
 Key: SPARK-29924
 URL: https://issues.apache.org/jira/browse/SPARK-29924
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is 
required for Arrow runtime on JDK9+ environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29923:
--
Parent: SPARK-29194
Issue Type: Sub-task  (was: Improvement)

> Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
> 
>
> Key: SPARK-29923
> URL: https://issues.apache.org/jira/browse/SPARK-29923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+

2019-11-15 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29923:
-

 Summary: Set `io.netty.tryReflectionSetAccessible` for Arrow on 
JDK9+
 Key: SPARK-29923
 URL: https://issues.apache.org/jira/browse/SPARK-29923
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29833) Add FileNotFoundException check for spark.yarn.jars

2019-11-15 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29833.

Fix Version/s: 3.0.0
 Assignee: ulysses you
   Resolution: Fixed

> Add FileNotFoundException check  for spark.yarn.jars
> 
>
> Key: SPARK-29833
> URL: https://issues.apache.org/jira/browse/SPARK-29833
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.4
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.0.0
>
>
> When set `spark.yarn.jars=/xxx/xxx` which is just a no schema path, spark 
> will throw a NullPointerException.
> The reason is hdfs will return null if pathFs.globStatus(path) is not exist, 
> and spark just use `pathFs.globStatus(path).filter(_.isFile())` without check 
> it.
> Related Globber code is here
> {noformat}
> /*
>  * When the input pattern "looks" like just a simple filename, and we
>  * can't find it, we return null rather than an empty array.
>  * This is a special case which the shell relies on.
>  *
>  * To be more precise: if there were no results, AND there were no
>  * groupings (aka brackets), and no wildcards in the input (aka stars),
>  * we return null.
>  */
> if ((!sawWildcard) && results.isEmpty() &&
> (flattenedPatterns.size() <= 1)) {
>   return null;
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29904) Parse timestamps in microsecond precision by JSON/CSV datasources

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29904.
---
Fix Version/s: 2.4.5
 Assignee: Maxim Gekk
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26507

> Parse timestamps in microsecond precision by JSON/CSV datasources
> -
>
> Key: SPARK-29904
> URL: https://issues.apache.org/jira/browse/SPARK-29904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.5
>
>
> Currently, Spark can parse strings with timestamps from JSON/CSV in 
> millisecond precision. Internally, timestamps have microsecond precision. The 
> ticket aims to modify parsing logic in Spark 2.4 to support the microsecond 
> precision. Porting of DateFormatter/TimestampFormatter from Spark 3.0-preview 
> is risky, so, need to find another lighter solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29904) Parse timestamps in microsecond precision by JSON/CSV datasources

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29904:
--
Issue Type: Bug  (was: Improvement)

> Parse timestamps in microsecond precision by JSON/CSV datasources
> -
>
> Key: SPARK-29904
> URL: https://issues.apache.org/jira/browse/SPARK-29904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, Spark can parse strings with timestamps from JSON/CSV in 
> millisecond precision. Internally, timestamps have microsecond precision. The 
> ticket aims to modify parsing logic in Spark 2.4 to support the microsecond 
> precision. Porting of DateFormatter/TimestampFormatter from Spark 3.0-preview 
> is risky, so, need to find another lighter solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29829:
--
Fix Version/s: (was: 3.1.0)
   3.0.0

> SHOW TABLE EXTENDED should look up catalog/table like v2 commands
> -
>
> Key: SPARK-29829
> URL: https://issues.apache.org/jira/browse/SPARK-29829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.0.0
>
>
> SHOW TABLE EXTENDED should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29829.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 26540
[https://github.com/apache/spark/pull/26540]

> SHOW TABLE EXTENDED should look up catalog/table like v2 commands
> -
>
> Key: SPARK-29829
> URL: https://issues.apache.org/jira/browse/SPARK-29829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.1.0
>
>
> SHOW TABLE EXTENDED should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands

2019-11-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29829:
-

Assignee: Pablo Langa Blanco

> SHOW TABLE EXTENDED should look up catalog/table like v2 commands
> -
>
> Key: SPARK-29829
> URL: https://issues.apache.org/jira/browse/SPARK-29829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
>
> SHOW TABLE EXTENDED should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29794) Column level compression

2019-11-15 Thread Anirudh Vyas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Vyas updated SPARK-29794:
-
Affects Version/s: 3.0.0

> Column level compression
> 
>
> Key: SPARK-29794
> URL: https://issues.apache.org/jira/browse/SPARK-29794
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Anirudh Vyas
>Priority: Minor
>
> Currently in spark we do not have capability to specify different 
> compressions for different columns, however this capability exists in parquet 
> format for example.
>  
> Not sure if this has been opened before (I am sure it might have been but I 
> cannot find it), hence opening a lane for potential improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29334) Supported vector operators in scala should have parity with pySpark

2019-11-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29334:
-
Shepherd:   (was: Sean R. Owen)

> Supported vector operators in scala should have parity with pySpark 
> 
>
> Key: SPARK-29334
> URL: https://issues.apache.org/jira/browse/SPARK-29334
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Patrick Pisciuneri
>Priority: Minor
>
> pySpark supports various overloaded operators for the DenseVector type that 
> the scala class does not support. 
> - ML: 
> https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462
> - MLLIB: 
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506
> We should be able to leverage the BLAS wrappers to implement these methods on 
> the scala side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29911) Cache table may memory leak when session closed

2019-11-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975307#comment-16975307
 ] 

Dongjoon Hyun commented on SPARK-29911:
---

Hi, [~cltlfcjin]. Since is reported as a memory leakage issue, could you check 
the older Spark version and update the `Affected Versions` of this JIRA issue 
please?

> Cache table may memory leak when session closed
> ---
>
> Key: SPARK-29911
> URL: https://issues.apache.org/jira/browse/SPARK-29911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png
>
>
> How to reproduce:
> 1. create a local temporary view v1
> 2. cache it in memory
> 3. close session without drop v1.
> The application will hold the memory forever. In a long running thrift server 
> scenario. It's worse.
> {code}
> 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
> CACHE TABLE testCacheTable AS SELECT 1;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (1.498 seconds)
> 0: jdbc:hive2://localhost:1> !close
> !close
> Closing: 0: jdbc:hive2://localhost:1
> 0: jdbc:hive2://localhost:1 (closed)> !connect 
> 'jdbc:hive2://localhost:1'
> !connect 'jdbc:hive2://localhost:1'
> Connecting to jdbc:hive2://localhost:1
> Enter username for jdbc:hive2://localhost:1:
> lajin
> Enter password for jdbc:hive2://localhost:1:
> ***
> Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 1: jdbc:hive2://localhost:1> select * from testCacheTable;
> select * from testCacheTable;
> Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
> view not found: testCacheTable; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [testCacheTable] (state=,code=0)
> {code}
>  !Screen Shot 2019-11-15 at 2.03.49 PM.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2019-11-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975305#comment-16975305
 ] 

Dongjoon Hyun commented on SPARK-29918:
---

Hi, [~EdisonWang]. What about the older Spark versions?

> RecordBinaryComparator should check endianness when compared by long
> 
>
> Key: SPARK-29918
> URL: https://issues.apache.org/jira/browse/SPARK-29918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>  Labels: correctness
>
> If the architecture supports unaligned or the offset is 8 bytes aligned, 
> RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a 
> long. Otherwise, it will compare bytes by bytes. 
> However, on little-endian machine,  the result of compared by a long value 
> and compared bytes by bytes maybe different. If the architectures in a yarn 
> cluster is different(Some is unaligned-access capable while others not), then 
> the sequence of two records after sorted is undetermined, which will result 
> in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975303#comment-16975303
 ] 

Dongjoon Hyun commented on SPARK-29900:
---

Thank you for pinging me, [~cloud_fan].

[~imback82]. When you compile the list, please consider `global temp view` 
together (which is different from a normal temp view).

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-15 Thread koert kuipers (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-29906:
--
Labels: correctness  (was: )

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Priority: Major
>  Labels: correctness
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
> Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
> Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, 
> Contextual_Information,

[jira] [Commented] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-15 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975286#comment-16975286
 ] 

koert kuipers commented on SPARK-29906:
---

note that with the default option for csv being enforceSchema=false this will 
not fail but produce incorrect results. therefore it is correctness issue.

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Priority: Major
>  Labels: correctness
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
>

[jira] [Created] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands

2019-11-15 Thread Pablo Langa Blanco (Jira)

Pablo Langa Blanco created SPARK-29922:
--

 Summary: SHOW FUNCTIONS should look up catalog/table like v2 
commands
 Key: SPARK-29922
 URL: https://issues.apache.org/jira/browse/SPARK-29922
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


SHOW FUNCTIONS should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands

2019-11-15 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975272#comment-16975272
 ] 

Pablo Langa Blanco commented on SPARK-29922:


I'm working on this

> SHOW FUNCTIONS should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29922
> URL: https://issues.apache.org/jira/browse/SPARK-29922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> SHOW FUNCTIONS should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29921) SparkContext LiveListenerBus

2019-11-15 Thread Arun sethia (Jira)

Arun sethia created SPARK-29921:
---

 Summary: SparkContext LiveListenerBus
 Key: SPARK-29921
 URL: https://issues.apache.org/jira/browse/SPARK-29921
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4, 2.4.3, 2.4.1
Reporter: Arun sethia


Hi,

 

I am not sure what is advantage of keeping listenerBus function as package 
level access for org.apache.spark.SparkContext. 

 

private[spark] def listenerBus: LiveListenerBus = _listenerBus

 

This limits anyone to publish any custom SparkListenerEvent to LiveListenerBus. 

 

Thanks,

Arun

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29920) Parsing failure on interval '20 15' day to hour

2019-11-15 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-29920:
--

 Summary: Parsing failure on interval '20 15' day to hour
 Key: SPARK-29920
 URL: https://issues.apache.org/jira/browse/SPARK-29920
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk



{code:sql}
spark-sql> select interval '20 15' day to hour;
Error in query:
requirement failed: Interval string must match day-time format of 'd h:m:s.n': 
20 15(line 1, pos 16)

== SQL ==
select interval '20 15' day to hour
^^^
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29919) remove python2 test execution

2019-11-15 Thread Shane Knapp (Jira)

Shane Knapp created SPARK-29919:
---

 Summary: remove python2 test execution
 Key: SPARK-29919
 URL: https://issues.apache.org/jira/browse/SPARK-29919
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, Tests
Affects Versions: 3.0.0
Reporter: Shane Knapp
Assignee: Shane Knapp


remove python2.7 (including pypy2) test executables from 'python/run-tests.py'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2019-11-15 Thread EdisonWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-29918:
---
Labels: correctness  (was: )

> RecordBinaryComparator should check endianness when compared by long
> 
>
> Key: SPARK-29918
> URL: https://issues.apache.org/jira/browse/SPARK-29918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>  Labels: correctness
>
> If the architecture supports unaligned or the offset is 8 bytes aligned, 
> RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a 
> long. Otherwise, it will compare bytes by bytes. 
> However, on little-endian machine,  the result of compared by a long value 
> and compared bytes by bytes maybe different. If the architectures in a yarn 
> cluster is different(Some is unaligned-access capable while others not), then 
> the sequence of two records after sorted is undetermined, which will result 
> in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2019-11-15 Thread EdisonWang (Jira)

EdisonWang created SPARK-29918:
--

 Summary: RecordBinaryComparator should check endianness when 
compared by long
 Key: SPARK-29918
 URL: https://issues.apache.org/jira/browse/SPARK-29918
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


If the architecture supports unaligned or the offset is 8 bytes aligned, 
RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a long. 
Otherwise, it will compare bytes by bytes. 

However, on little-endian machine,  the result of compared by a long value and 
compared bytes by bytes maybe different. If the architectures in a yarn cluster 
is different(Some is unaligned-access capable while others not), then the 
sequence of two records after sorted is undetermined, which will result in the 
same problem as in https://issues.apache.org/jira/browse/SPARK-23207

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29917) Provide functionality to rename Receivers on Spark Streaming Page

2019-11-15 Thread Jira

Burak KÖSE created SPARK-29917:
--

 Summary: Provide functionality to rename Receivers on Spark 
Streaming Page
 Key: SPARK-29917
 URL: https://issues.apache.org/jira/browse/SPARK-29917
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 2.4.4
Reporter: Burak KÖSE


In ReceiverSupervisorImpl, it is hardcoded(getSimpleName) to use the class name 
of the receiver. Spark should provide a functionality for users to set their 
custom names for receivers. It will be especially useful for users having a lot 
of Receiver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29902) Add listener event queue capacity configuration to documentation

2019-11-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29902.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 26529
[https://github.com/apache/spark/pull/26529]

> Add listener event queue capacity configuration to documentation
> 
>
> Key: SPARK-29902
> URL: https://issues.apache.org/jira/browse/SPARK-29902
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add listener event queue capacity configuration to documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29902) Add listener event queue capacity configuration to documentation

2019-11-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29902:


Assignee: shahid

> Add listener event queue capacity configuration to documentation
> 
>
> Key: SPARK-29902
> URL: https://issues.apache.org/jira/browse/SPARK-29902
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
>
> Add listener event queue capacity configuration to documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29902) Add listener event queue capacity configuration to documentation

2019-11-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29902:
-
Fix Version/s: (was: 3.1.0)
   3.0.0

> Add listener event queue capacity configuration to documentation
> 
>
> Key: SPARK-29902
> URL: https://issues.apache.org/jira/browse/SPARK-29902
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add listener event queue capacity configuration to documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29916) spark on kubernetes fails with hadoop-3.2 due to the user not existing in executor pod

2019-11-15 Thread Jira

Michał Wesołowski created SPARK-29916:
-

 Summary: spark on kubernetes fails with hadoop-3.2 due to the user 
not existing in executor pod
 Key: SPARK-29916
 URL: https://issues.apache.org/jira/browse/SPARK-29916
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Michał Wesołowski


I'm running tests on kubernetes with spark-3.0-preview version with hadoop-3.2 
libraries. 

I needed cloud libraries (azure in particular) support so this is build based 
on v3.0.0-preview tag with cloud profile since binaries provided don't contain 
it. 

I run simple computation on AKS (azure kubernetes service) with Azure Data Lake 
Storage gen2 and with it fails with the following error:
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o49.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, 10.244.2.6, executor 1): java.io.IOException: There is no primary group for 
UGI localuser(auth:SIMPLE)
at 
org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1455)
at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:136)
at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
 {code}
It looks like hadoop library was expecting the user "localuser" to exist in 
executor pod. This user is the one which invoked spark-submit on my local 
machine, however I didn't set it explicitly.  

 

I investigated the pod and this user is set in SPARK_USER environment variable 
in both executor and driver pods. 

Relevant logs from executor:
{code:java}
19/11/15 12:56:52 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(root, localuser); 
groups with view permissions: Set(); users  with modify permissions: Set(root, 
localuser); groups with modify permissions: Set() 
...
19/11/15 12:56:53 INFO SecurityManager: Changing view acls to: root,localuser
19/11/15 12:56:53 INFO SecurityManager: Changing modify acls to: root,localuser
19/11/15 12:56:53 INFO SecurityManager: Changing view acls groups to:
19/11/15 12:56:53 INFO SecurityManager: Changing modify acls groups to:
...
19/11/15 12:57:02 WARN ShellBasedUnixGroupsMapping: unable to return groups for 
user localuser
PartialGroupNameException The user name 'localuser' is not found. id: 
‘localuser’: no such user
id: ‘localuser’: no such userat 
org.apache.hadoop.security.ShellBasedUnixGroupsMapping.resolvePartialGroupNames(ShellBasedUnixGroupsMapping.java:294)
at 
org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:207)
at 
org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:97)
at 
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:51)
at 
org.apache.hadoop.security.Groups$GroupCacheLoader.fetchGroupList(Groups.java:387)
at 
org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:321)
at 
org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:270)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:228)
at 
org.apache.hadoop.security.UserGroupInformation.getGroups(UserGroupInformation.java:1588)
at 
org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1453)
at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:136)
at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)

{code}
One woraround for this I've found is

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-15 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975054#comment-16975054
 ] 

Maciej Szymkiewicz commented on SPARK-29748:


[~jhereth]

{quote}With simply removing sorting we change the semantics, e.g. `Row(a=1, 
b=2) != Row(b=2, a=1)` (opposed to what we currently have.{quote}

It is even more messy. At the moment we adhere to {{tuple}} semantics so 
{{Row(a=1, b=2) == Row(y=1, z=2)}}. That might be acceptable (namedtuples use 
the same approach, but I think we should state that explicitly).

{quote}I think Maciej Szymkiewicz was thinking about changes for the upcoming 
3.0?{quote}

Indeed.

[~bryanc]

Let me clarify things - I am not suggesting that any of these changes should be 
implemented here. Instead I think we should have clear picture what {{Row}} 
suppose to be (not only in terms of API, but also intended applications) before 
we decide on a concrete solution. That's particularly important because we 
already have special cases that were introduced specifically to target 
{{**kwargs}} and sorting behavior.

That being said, if we want to discuss this case in isolation

*  Introducing {{LegacyRow}} seems to make little sense if implementation of 
{{Row}} stays the same otherwise.  Sorting or not, depending on the config, 
should be enough.
* {quote} Users with Python < 3.6 will have to create Rows with an OrderedDict 
or by using the Row class as a factory (explained in the pydoc).  {quote}   I 
don't think we should introduce such behavior now, when 3.5 is deprecated. 
Having yet another way to initialize {{Row}} will be confusing at best (and 
introduce new problems  when using complex structures). Furthermore we already 
have one mechanism that provides ordered behavior independent of version.

Instead I'd suggest we:

* Make legacy behavior the only option for Python < 3.6. 
* For Python 3.6 let's introduce legacy sorting mechanism (keeping only single 
{{Row}}) class, enabled by default and deprecated. 




> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29915) spark-py and spark-r images are not created with docker-image-tool.sh

2019-11-15 Thread Jira

Michał Wesołowski created SPARK-29915:
-

 Summary: spark-py and spark-r images are not created with 
docker-image-tool.sh
 Key: SPARK-29915
 URL: https://issues.apache.org/jira/browse/SPARK-29915
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Michał Wesołowski


Currently at version 3.0.0-preview docker-image-tool.sh script has the 
[following 
lines|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh#L173]]
 defined:
{code:java}
 local PYDOCKERFILE=${PYDOCKERFILE:-false} 
 local RDOCKERFILE=${RDOCKERFILE:-false} {code}
Because of this change spark-py nor spark-r images get created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big

2019-11-15 Thread ZhanxiongWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhanxiongWang updated SPARK-29114:
--
Description: 
Updated time:15/Nov/19

We saw this blog solved our confusion.

[http://www.russellspitzer.com/2018/05/10/SparkPartitions/|http://www.russellspitzer.com/2018/05/10/SparkPartitions/]



Updated time:15/Nov/19

I discussed this issue with my colleagues today. We think that spark has caused 
cross-border problems in the process of doing shuffle.

The problem may be in the Sort-based Shuffle stage. When the map task partition 
is too large, and the storage of the writerIndex variable uses int, writerIndex 
may cause cross-border problems. If this is the case, the variable writerIndex 
{color:#de350b}replaces int with long{color} should solve the current problem.



I create a Dataset df with 200 partitions. I applied for 100 executors for 
my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I 
use df.cache() before df.coalesce(10). When{color:#de350b} Dataset 
partition{color} {color:#de350b}size is small{color}, the program works well. 
But when I {color:#de350b}increase{color} the size of the Dataset 
partition , the function {color:#de350b}df.coalesce(10){color} will throw 
ChunkFetchFailureException.

19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and 
clearing cache
 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 
1003
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as 
bytes in memory (estimated size 49.4 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 
7 ms
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
memory (estimated size 154.5 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection 
to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
rdd_1005_18, and will not retry (0 retries)
 org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
capacity(-2137154997))
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch 
failures. Most

[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big

2019-11-15 Thread ZhanxiongWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhanxiongWang updated SPARK-29114:
--
Description: 
Updated time:15/Nov/19



We saw this blog solved our confusion.


[http://www.russellspitzer.com/2018/05/10/SparkPartitions/|http://example.com]



Updated time:15/Nov/19

I discussed this issue with my colleagues today. We think that spark has caused 
cross-border problems in the process of doing shuffle.

The problem may be in the Sort-based Shuffle stage. When the map task partition 
is too large, and the storage of the writerIndex variable uses int, writerIndex 
may cause cross-border problems. If this is the case, the variable writerIndex 
{color:#de350b}replaces int with long{color} should solve the current problem.



I create a Dataset df with 200 partitions. I applied for 100 executors for 
my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I 
use df.cache() before df.coalesce(10). When{color:#de350b} Dataset 
partition{color} {color:#de350b}size is small{color}, the program works well. 
But when I {color:#de350b}increase{color} the size of the Dataset 
partition , the function {color:#de350b}df.coalesce(10){color} will throw 
ChunkFetchFailureException.

19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and 
clearing cache
 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 
1003
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as 
bytes in memory (estimated size 49.4 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 
7 ms
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
memory (estimated size 154.5 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection 
to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
rdd_1005_18, and will not retry (0 retries)
 org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
capacity(-2137154997))
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch 
failures. Most recent failure cause:

[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big

2019-11-15 Thread ZhanxiongWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhanxiongWang updated SPARK-29114:
--
Description: 
Updated time:15/Nov/19

We saw this blog solved our confusion.

[[http://www.russellspitzer.com/2018/05/10/SparkPartitions/ 
||http://example.com/] 
[http://www.russellspitzer.com/2018/05/10/SparkPartitions/|http://example.com/] 
[]|http://example.com/]



Updated time:15/Nov/19

I discussed this issue with my colleagues today. We think that spark has caused 
cross-border problems in the process of doing shuffle.

The problem may be in the Sort-based Shuffle stage. When the map task partition 
is too large, and the storage of the writerIndex variable uses int, writerIndex 
may cause cross-border problems. If this is the case, the variable writerIndex 
{color:#de350b}replaces int with long{color} should solve the current problem.



I create a Dataset df with 200 partitions. I applied for 100 executors for 
my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I 
use df.cache() before df.coalesce(10). When{color:#de350b} Dataset 
partition{color} {color:#de350b}size is small{color}, the program works well. 
But when I {color:#de350b}increase{color} the size of the Dataset 
partition , the function {color:#de350b}df.coalesce(10){color} will throw 
ChunkFetchFailureException.

19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and 
clearing cache
 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 
1003
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as 
bytes in memory (estimated size 49.4 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 
7 ms
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
memory (estimated size 154.5 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection 
to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
rdd_1005_18, and will not retry (0 retries)
 org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
capacity(-2137154997))
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
 19/09/17 08:26:46

[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big

2019-11-15 Thread ZhanxiongWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhanxiongWang updated SPARK-29114:
--
Description: 
Updated time:15/Nov/19
 I discussed this issue with my colleagues today. We think that spark has 
caused cross-border problems in the process of doing shuffle.

The problem may be in the Sort-based Shuffle stage. When the map task partition 
is too large, and the storage of the writerIndex variable uses int, writerIndex 
may cause cross-border problems. If this is the case, the variable writerIndex 
{color:#de350b}replaces int with long{color} should solve the current problem.



I create a Dataset df with 200 partitions. I applied for 100 executors for 
my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I 
use df.cache() before df.coalesce(10). When{color:#de350b} Dataset 
partition{color} {color:#de350b}size is small{color}, the program works well. 
But when I {color:#de350b}increase{color} the size of the Dataset 
partition , the function {color:#de350b}df.coalesce(10){color} will throw 
ChunkFetchFailureException.

19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and 
clearing cache
 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 
1003
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as 
bytes in memory (estimated size 49.4 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 
7 ms
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
memory (estimated size 154.5 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection 
to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
rdd_1005_18, and will not retry (0 retries)
 org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
capacity(-2137154997))
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch 
failures. Most recent failure cause:
 org.apache.spark.SparkException: Exception thrown in awaitResult: 
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
 at

[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big

2019-11-15 Thread ZhanxiongWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhanxiongWang updated SPARK-29114:
--
Description: 
Updated time:15/Nov/19
I discussed this issue with my colleagues today. We think that spark has caused 
cross-border problems in the process of doing shuffle.

The problem may be in the Sort-based Shuffle stage. When the map task partition 
is too large, and the storage of the Index variable uses int, Index may cause 
cross-border problems. If this is the case, the variable index 
{color:#de350b}replaces int with long{color} should solve the current problem.



I create a Dataset df with 200 partitions. I applied for 100 executors for 
my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I 
use df.cache() before df.coalesce(10). When{color:#de350b} Dataset 
partition{color} {color:#de350b}size is small{color}, the program works well. 
But when I {color:#de350b}increase{color} the size of the Dataset 
partition , the function {color:#de350b}df.coalesce(10){color} will throw 
ChunkFetchFailureException.

19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and 
clearing cache
 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 
1003
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as 
bytes in memory (estimated size 49.4 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 
7 ms
 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
memory (estimated size 154.5 KB, free 3.8 GB)
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection 
to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
rdd_1005_18, and will not retry (0 retries)
 org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
capacity(-2137154997))
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch 
failures. Most recent failure cause:
 org.apache.spark.SparkException: Exception thrown in awaitResult: 
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
 at

[jira] [Created] (SPARK-29914) ML models append metadata in `transform`/`transformSchema`

2019-11-15 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-29914:


 Summary: ML models append metadata in `transform`/`transformSchema`
 Key: SPARK-29914
 URL: https://issues.apache.org/jira/browse/SPARK-29914
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


There are many impls (like 
`Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...)
 in `.ml` that append appropriate metadata in `transform`/`transformSchema` 
method.

However there are also many impls return no metadata in transformation, even 
some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29913) Improve Exception in postgreCastToBoolean

2019-11-15 Thread jobit mathew (Jira)

jobit mathew created SPARK-29913:


 Summary: Improve Exception in postgreCastToBoolean 
 Key: SPARK-29913
 URL: https://issues.apache.org/jira/browse/SPARK-29913
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Improve Exception in postgreCastToBoolean 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29619) Add retry times when reading the daemon port.

2019-11-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29619.
--
Resolution: Won't Fix

> Add retry times when reading the daemon port.
> -
>
> Key: SPARK-29619
> URL: https://issues.apache.org/jira/browse/SPARK-29619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> This ticket is related to https://issues.apache.org/jira/browse/SPARK-29885 
> and add try mechanism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab

2019-11-15 Thread Luca Canali (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-29894:

Attachment: Physical_plan_Annotated.png

> Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
> ---
>
> Key: SPARK-29894
> URL: https://issues.apache.org/jira/browse/SPARK-29894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: Physical_plan_Annotated.png, 
> snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png, 
> snippet_plan_graph_before_patch.png
>
>
> The Web UI SQL Tab provides information on the executed SQL using plan graphs 
> and SQL execution plans. Both provide useful information. Physical execution 
> plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also 
> reported in the plan graphs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition

2019-11-15 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-29912:

Description: 
It is meaningless to do  `repartition(1)` or `coalesce(1)` when a child plan 
just output one partition.
Now, we can not get the output numPartitions during logic plan, so this issue 
pruning the operation in physical plan.

  was:
It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
just output one partition.
Now, we can not get the output numPartitions during logic plan, so this issue 
pruning the operation in physical plan.


> Pruning shuffle exchange and coalesce when input and output both are one 
> partition
> --
>
> Key: SPARK-29912
> URL: https://issues.apache.org/jira/browse/SPARK-29912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> It is meaningless to do  `repartition(1)` or `coalesce(1)` when a child plan 
> just output one partition.
> Now, we can not get the output numPartitions during logic plan, so this issue 
> pruning the operation in physical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition

2019-11-15 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-29912:

Description: 
It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
just have one partition.
Now, we can not get the output numPartitions during logic plan, so this issue 
pruning the operation in physical plan.

  was:
It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
just have one partition.
Now, we can not get the output numPartitions during logic plan, so the issue 
pruning the operation in physical plan.


> Pruning shuffle exchange and coalesce when input and output both are one 
> partition
> --
>
> Key: SPARK-29912
> URL: https://issues.apache.org/jira/browse/SPARK-29912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
> just have one partition.
> Now, we can not get the output numPartitions during logic plan, so this issue 
> pruning the operation in physical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition

2019-11-15 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-29912:

Description: 
It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
just output one partition.
Now, we can not get the output numPartitions during logic plan, so this issue 
pruning the operation in physical plan.

  was:
It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
just have one partition.
Now, we can not get the output numPartitions during logic plan, so this issue 
pruning the operation in physical plan.


> Pruning shuffle exchange and coalesce when input and output both are one 
> partition
> --
>
> Key: SPARK-29912
> URL: https://issues.apache.org/jira/browse/SPARK-29912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
> just output one partition.
> Now, we can not get the output numPartitions during logic plan, so this issue 
> pruning the operation in physical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition

2019-11-15 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-29912:

Summary: Pruning shuffle exchange and coalesce when input and output both 
are one partition  (was: Pruning shuffle exchange when input and output both 
are one partition)

> Pruning shuffle exchange and coalesce when input and output both are one 
> partition
> --
>
> Key: SPARK-29912
> URL: https://issues.apache.org/jira/browse/SPARK-29912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
> just have one partition.
> Now, we can not get the output numPartitions during logic plan, so the issue 
> pruning the operation in physical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29912) Pruning shuffle exchange when input and output both are one partition

2019-11-15 Thread ulysses you (Jira)

ulysses you created SPARK-29912:
---

 Summary: Pruning shuffle exchange when input and output both are 
one partition
 Key: SPARK-29912
 URL: https://issues.apache.org/jira/browse/SPARK-29912
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: ulysses you


It is meaningless to do  `repartition(1)` or `coalesce(1)` when a plan child 
just have one partition.
Now, we can not get the output numPartitions during logic plan, so the issue 
pruning the operation in physical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

68 matches

Mail list logo