[jira] [Resolved] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
[ https://issues.apache.org/jira/browse/SPARK-29923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29923. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26552 [https://github.com/apache/spark/pull/26552] > Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+ > > > Key: SPARK-29923 > URL: https://issues.apache.org/jira/browse/SPARK-29923 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
[ https://issues.apache.org/jira/browse/SPARK-29923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29923: - Assignee: Dongjoon Hyun > Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+ > > > Key: SPARK-29923 > URL: https://issues.apache.org/jira/browse/SPARK-29923 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26362: -- Labels: release-notes (was: releasenotes) > Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark > contexts > --- > > Key: SPARK-26362 > URL: https://issues.apache.org/jira/browse/SPARK-26362 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > Multiple Spark contexts are discouraged and it has been warning from 4 years > ago (see SPARK-4180). > It could cause arbitrary and mysterious error cases. (Honestly, I didn't even > know Spark allows it). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26651: -- Labels: release-notes (was: ReleaseNote) > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: release-notes > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975609#comment-16975609 ] Terry Kim commented on SPARK-29890: --- Sure. I will take a look. > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.3, 2.4.3 >Reporter: sandeshyapuram >Priority: Major > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975606#comment-16975606 ] Wenchen Fan commented on SPARK-29890: - seems like another self-join bug. [~imback82] can you take a look? > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.3, 2.4.3 >Reporter: sandeshyapuram >Priority: Major > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29867) add __repr__ in Python ML Models
[ https://issues.apache.org/jira/browse/SPARK-29867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29867. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26489 [https://github.com/apache/spark/pull/26489] > add __repr__ in Python ML Models > > > Key: SPARK-29867 > URL: https://issues.apache.org/jira/browse/SPARK-29867 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > In Python ML Models, some of them have ___repr, others don't. In the > doctest, when calling Model.setXXX, some of the Models print out the > xxxModel... correctly, some of them can't because of lacking the repr___ > method. This Jira addresses this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29867) add __repr__ in Python ML Models
[ https://issues.apache.org/jira/browse/SPARK-29867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29867: - Assignee: Huaxin Gao > add __repr__ in Python ML Models > > > Key: SPARK-29867 > URL: https://issues.apache.org/jira/browse/SPARK-29867 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > In Python ML Models, some of them have ___repr, others don't. In the > doctest, when calling Model.setXXX, some of the Models print out the > xxxModel... correctly, some of them can't because of lacking the repr___ > method. This Jira addresses this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeshyapuram updated SPARK-29890: --- Affects Version/s: 2.4.3 > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.3, 2.4.3 >Reporter: sandeshyapuram >Priority: Major > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29834) DESC DATABASE should look up catalog like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29834: - Assignee: Hu Fuwang > DESC DATABASE should look up catalog like v2 commands > - > > Key: SPARK-29834 > URL: https://issues.apache.org/jira/browse/SPARK-29834 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Assignee: Hu Fuwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29834) DESC DATABASE should look up catalog like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29834. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26513 [https://github.com/apache/spark/pull/26513] > DESC DATABASE should look up catalog like v2 commands > - > > Key: SPARK-29834 > URL: https://issues.apache.org/jira/browse/SPARK-29834 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Assignee: Hu Fuwang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests
[ https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29127. --- Fix Version/s: 3.0.0 Assignee: Hyukjin Kwon Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/26538 > Add a Python, Pandas and PyArrow versions in clue at SQL query tests > > > Key: SPARK-29127 > URL: https://issues.apache.org/jira/browse/SPARK-29127 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Once Python test cases is failed in integrated UDF test cases, it's difficult > to find out the version informations. See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ > as an example > It might be better to add the version information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests
[ https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29127: -- Reporter: Hyukjin Kwon (was: Burak Yavuz) > Add a Python, Pandas and PyArrow versions in clue at SQL query tests > > > Key: SPARK-29127 > URL: https://issues.apache.org/jira/browse/SPARK-29127 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Once Python test cases is failed in integrated UDF test cases, it's difficult > to find out the version informations. See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ > as an example > It might be better to add the version information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29908) Add a Python, Pandas and PyArrow versions in clue at SQL query tests
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29908: -- Priority: Blocker (was: Major) > Add a Python, Pandas and PyArrow versions in clue at SQL query tests > > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > Once Python test cases is failed in integrated UDF test cases, it's difficult > to find out the version informations. See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ > as an example > It might be better to add the version information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests
[ https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29127: -- Summary: Add a Python, Pandas and PyArrow versions in clue at SQL query tests (was: Support partitioning for DataSource V2 tables in DataFrameWriter.save) > Add a Python, Pandas and PyArrow versions in clue at SQL query tests > > > Key: SPARK-29127 > URL: https://issues.apache.org/jira/browse/SPARK-29127 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29127) Add a Python, Pandas and PyArrow versions in clue at SQL query tests
[ https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29127: -- Description: Once Python test cases is failed in integrated UDF test cases, it's difficult to find out the version informations. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ as an example It might be better to add the version information. was:Currently, any data source that that upgrades to DataSource V2 loses the partition transform information when using DataFrameWriter.save. The main reason is the lack of an API for "creating" a table with partitioning and schema information for V2 tables without a catalog. > Add a Python, Pandas and PyArrow versions in clue at SQL query tests > > > Key: SPARK-29127 > URL: https://issues.apache.org/jira/browse/SPARK-29127 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > Once Python test cases is failed in integrated UDF test cases, it's difficult > to find out the version informations. See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ > as an example > It might be better to add the version information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29908: -- Summary: Support partitioning for DataSource V2 tables in DataFrameWriter.save (was: Add a Python, Pandas and PyArrow versions in clue at SQL query tests) > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > Once Python test cases is failed in integrated UDF test cases, it's difficult > to find out the version informations. See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ > as an example > It might be better to add the version information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29908) Add a Python, Pandas and PyArrow versions in clue at SQL query tests
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29908: -- Component/s: (was: PySpark) > Add a Python, Pandas and PyArrow versions in clue at SQL query tests > > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > Once Python test cases is failed in integrated UDF test cases, it's difficult > to find out the version informations. See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ > as an example > It might be better to add the version information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29127: -- Priority: Major (was: Blocker) > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29127 > URL: https://issues.apache.org/jira/browse/SPARK-29127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29127: -- Component/s: PySpark > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29127 > URL: https://issues.apache.org/jira/browse/SPARK-29127 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29908: -- Description: Currently, any data source that that upgrades to DataSource V2 loses the partition transform information when using DataFrameWriter.save. The main reason is the lack of an API for "creating" a table with partitioning and schema information for V2 tables without a catalog. (was: Once Python test cases is failed in integrated UDF test cases, it's difficult to find out the version informations. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ as an example It might be better to add the version information.) > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29908: -- Reporter: Burak Yavuz (was: Hyukjin Kwon) > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975539#comment-16975539 ] Dongjoon Hyun commented on SPARK-29127: --- Hi, [~brkyvz] and [~hyukjin.kwon]. Sorry, but I'll switch the both JIRA issue IDs due to the following. - https://github.com/apache/spark/commit/7720781695d47fe0375f6e1150f6981b886686bd > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29127 > URL: https://issues.apache.org/jira/browse/SPARK-29127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29924) Document Arrow requirement in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975537#comment-16975537 ] Dongjoon Hyun commented on SPARK-29924: --- cc [~bryanc] > Document Arrow requirement in JDK9+ > --- > > Key: SPARK-29924 > URL: https://issues.apache.org/jira/browse/SPARK-29924 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is > required for Arrow runtime on JDK9+ environment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29924) Document Arrow requirement in JDK9+
Dongjoon Hyun created SPARK-29924: - Summary: Document Arrow requirement in JDK9+ Key: SPARK-29924 URL: https://issues.apache.org/jira/browse/SPARK-29924 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.0.0 Reporter: Dongjoon Hyun At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is required for Arrow runtime on JDK9+ environment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
[ https://issues.apache.org/jira/browse/SPARK-29923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29923: -- Parent: SPARK-29194 Issue Type: Sub-task (was: Improvement) > Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+ > > > Key: SPARK-29923 > URL: https://issues.apache.org/jira/browse/SPARK-29923 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
Dongjoon Hyun created SPARK-29923: - Summary: Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+ Key: SPARK-29923 URL: https://issues.apache.org/jira/browse/SPARK-29923 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29833) Add FileNotFoundException check for spark.yarn.jars
[ https://issues.apache.org/jira/browse/SPARK-29833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29833. Fix Version/s: 3.0.0 Assignee: ulysses you Resolution: Fixed > Add FileNotFoundException check for spark.yarn.jars > > > Key: SPARK-29833 > URL: https://issues.apache.org/jira/browse/SPARK-29833 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.4 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.0.0 > > > When set `spark.yarn.jars=/xxx/xxx` which is just a no schema path, spark > will throw a NullPointerException. > The reason is hdfs will return null if pathFs.globStatus(path) is not exist, > and spark just use `pathFs.globStatus(path).filter(_.isFile())` without check > it. > Related Globber code is here > {noformat} > /* > * When the input pattern "looks" like just a simple filename, and we > * can't find it, we return null rather than an empty array. > * This is a special case which the shell relies on. > * > * To be more precise: if there were no results, AND there were no > * groupings (aka brackets), and no wildcards in the input (aka stars), > * we return null. > */ > if ((!sawWildcard) && results.isEmpty() && > (flattenedPatterns.size() <= 1)) { > return null; > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29904) Parse timestamps in microsecond precision by JSON/CSV datasources
[ https://issues.apache.org/jira/browse/SPARK-29904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29904. --- Fix Version/s: 2.4.5 Assignee: Maxim Gekk Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/26507 > Parse timestamps in microsecond precision by JSON/CSV datasources > - > > Key: SPARK-29904 > URL: https://issues.apache.org/jira/browse/SPARK-29904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.5 > > > Currently, Spark can parse strings with timestamps from JSON/CSV in > millisecond precision. Internally, timestamps have microsecond precision. The > ticket aims to modify parsing logic in Spark 2.4 to support the microsecond > precision. Porting of DateFormatter/TimestampFormatter from Spark 3.0-preview > is risky, so, need to find another lighter solution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29904) Parse timestamps in microsecond precision by JSON/CSV datasources
[ https://issues.apache.org/jira/browse/SPARK-29904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29904: -- Issue Type: Bug (was: Improvement) > Parse timestamps in microsecond precision by JSON/CSV datasources > - > > Key: SPARK-29904 > URL: https://issues.apache.org/jira/browse/SPARK-29904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Priority: Major > > Currently, Spark can parse strings with timestamps from JSON/CSV in > millisecond precision. Internally, timestamps have microsecond precision. The > ticket aims to modify parsing logic in Spark 2.4 to support the microsecond > precision. Porting of DateFormatter/TimestampFormatter from Spark 3.0-preview > is risky, so, need to find another lighter solution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29829: -- Fix Version/s: (was: 3.1.0) 3.0.0 > SHOW TABLE EXTENDED should look up catalog/table like v2 commands > - > > Key: SPARK-29829 > URL: https://issues.apache.org/jira/browse/SPARK-29829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > Fix For: 3.0.0 > > > SHOW TABLE EXTENDED should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29829. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 26540 [https://github.com/apache/spark/pull/26540] > SHOW TABLE EXTENDED should look up catalog/table like v2 commands > - > > Key: SPARK-29829 > URL: https://issues.apache.org/jira/browse/SPARK-29829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > Fix For: 3.1.0 > > > SHOW TABLE EXTENDED should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29829: - Assignee: Pablo Langa Blanco > SHOW TABLE EXTENDED should look up catalog/table like v2 commands > - > > Key: SPARK-29829 > URL: https://issues.apache.org/jira/browse/SPARK-29829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > > SHOW TABLE EXTENDED should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29794) Column level compression
[ https://issues.apache.org/jira/browse/SPARK-29794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Vyas updated SPARK-29794: - Affects Version/s: 3.0.0 > Column level compression > > > Key: SPARK-29794 > URL: https://issues.apache.org/jira/browse/SPARK-29794 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 3.0.0 >Reporter: Anirudh Vyas >Priority: Minor > > Currently in spark we do not have capability to specify different > compressions for different columns, however this capability exists in parquet > format for example. > > Not sure if this has been opened before (I am sure it might have been but I > cannot find it), hence opening a lane for potential improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29334) Supported vector operators in scala should have parity with pySpark
[ https://issues.apache.org/jira/browse/SPARK-29334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29334: - Shepherd: (was: Sean R. Owen) > Supported vector operators in scala should have parity with pySpark > > > Key: SPARK-29334 > URL: https://issues.apache.org/jira/browse/SPARK-29334 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Patrick Pisciuneri >Priority: Minor > > pySpark supports various overloaded operators for the DenseVector type that > the scala class does not support. > - ML: > https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462 > - MLLIB: > https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506 > We should be able to leverage the BLAS wrappers to implement these methods on > the scala side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29911) Cache table may memory leak when session closed
[ https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975307#comment-16975307 ] Dongjoon Hyun commented on SPARK-29911: --- Hi, [~cltlfcjin]. Since is reported as a memory leakage issue, could you check the older Spark version and update the `Affected Versions` of this JIRA issue please? > Cache table may memory leak when session closed > --- > > Key: SPARK-29911 > URL: https://issues.apache.org/jira/browse/SPARK-29911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png > > > How to reproduce: > 1. create a local temporary view v1 > 2. cache it in memory > 3. close session without drop v1. > The application will hold the memory forever. In a long running thrift server > scenario. It's worse. > {code} > 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; > CACHE TABLE testCacheTable AS SELECT 1; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (1.498 seconds) > 0: jdbc:hive2://localhost:1> !close > !close > Closing: 0: jdbc:hive2://localhost:1 > 0: jdbc:hive2://localhost:1 (closed)> !connect > 'jdbc:hive2://localhost:1' > !connect 'jdbc:hive2://localhost:1' > Connecting to jdbc:hive2://localhost:1 > Enter username for jdbc:hive2://localhost:1: > lajin > Enter password for jdbc:hive2://localhost:1: > *** > Connected to: Spark SQL (version 3.0.0-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 1: jdbc:hive2://localhost:1> select * from testCacheTable; > select * from testCacheTable; > Error: Error running query: org.apache.spark.sql.AnalysisException: Table or > view not found: testCacheTable; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [testCacheTable] (state=,code=0) > {code} > !Screen Shot 2019-11-15 at 2.03.49 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long
[ https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975305#comment-16975305 ] Dongjoon Hyun commented on SPARK-29918: --- Hi, [~EdisonWang]. What about the older Spark versions? > RecordBinaryComparator should check endianness when compared by long > > > Key: SPARK-29918 > URL: https://issues.apache.org/jira/browse/SPARK-29918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > Labels: correctness > > If the architecture supports unaligned or the offset is 8 bytes aligned, > RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a > long. Otherwise, it will compare bytes by bytes. > However, on little-endian machine, the result of compared by a long value > and compared bytes by bytes maybe different. If the architectures in a yarn > cluster is different(Some is unaligned-access capable while others not), then > the sequence of two records after sorted is undetermined, which will result > in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975303#comment-16975303 ] Dongjoon Hyun commented on SPARK-29900: --- Thank you for pinging me, [~cloud_fan]. [~imback82]. When you compile the list, please consider `global temp view` together (which is different from a normal temp view). > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on
[ https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-29906: -- Labels: correctness (was: ) > Reading of csv file fails with adaptive execution turned on > --- > > Key: SPARK-29906 > URL: https://issues.apache.org/jira/browse/SPARK-29906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: build from master today nov 14 > commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, > upstream/master, upstream/HEAD) > Author: Kevin Yu > Date: Thu Nov 14 14:58:32 2019 -0600 > build using: > $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn > deployed on AWS EMR 5.28 with 10 m5.xlarge slaves > in spark-env.sh: > HADOOP_CONF_DIR=/etc/hadoop/conf > in spark-defaults.conf: > spark.master yarn > spark.submit.deployMode client > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.hadoop.yarn.timeline-service.enabled false > spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar > spark.driver.extraLibraryPath > /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native > spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar > spark.executor.extraLibraryPath > /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native >Reporter: koert kuipers >Priority: Major > Labels: correctness > > we observed an issue where spark seems to confuse a data line (not the first > line of the csv file) for the csv header when it creates the schema. > {code} > $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP > $ unzip PGYR13_P062819.ZIP > $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv > $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf > spark.sql.adaptive.enabled=true --num-executors 10 > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor > spark.yarn.archive is set, falling back to uploading libraries under > SPARK_HOME. > Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040 > Spark context available as 'sc' (master = yarn, app id = > application_1573772077642_0006). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.read.format("csv").option("header", > true).option("enforceSchema", > false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1) > 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > [Stage 2:>(0 + 10) / > 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): > java.lang.IllegalArgumentException: CSV header does not conform to the schema. > Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, > Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, > Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, > Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, > Recipient_Primary_Business_Street_Address_Line2, Recipient_City, > Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, > Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, > Physician_License_State_code1, Physician_License_State_code2, > Physician_License_State_code3, Physician_License_State_code4, > Physician_License_State_code5, > Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, > Total_Amount_of_Payment_USDollars, Date_of_Payment, > Number_of_Payments_Included_in_Total_Amount, > Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, > City_of_Travel, State_of_Travel, Country_of_Travel, > Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, > Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, > Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, > Contextual_Information,
[jira] [Commented] (SPARK-29906) Reading of csv file fails with adaptive execution turned on
[ https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975286#comment-16975286 ] koert kuipers commented on SPARK-29906: --- note that with the default option for csv being enforceSchema=false this will not fail but produce incorrect results. therefore it is correctness issue. > Reading of csv file fails with adaptive execution turned on > --- > > Key: SPARK-29906 > URL: https://issues.apache.org/jira/browse/SPARK-29906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: build from master today nov 14 > commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, > upstream/master, upstream/HEAD) > Author: Kevin Yu > Date: Thu Nov 14 14:58:32 2019 -0600 > build using: > $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn > deployed on AWS EMR 5.28 with 10 m5.xlarge slaves > in spark-env.sh: > HADOOP_CONF_DIR=/etc/hadoop/conf > in spark-defaults.conf: > spark.master yarn > spark.submit.deployMode client > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.hadoop.yarn.timeline-service.enabled false > spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar > spark.driver.extraLibraryPath > /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native > spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar > spark.executor.extraLibraryPath > /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native >Reporter: koert kuipers >Priority: Major > Labels: correctness > > we observed an issue where spark seems to confuse a data line (not the first > line of the csv file) for the csv header when it creates the schema. > {code} > $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP > $ unzip PGYR13_P062819.ZIP > $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv > $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf > spark.sql.adaptive.enabled=true --num-executors 10 > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor > spark.yarn.archive is set, falling back to uploading libraries under > SPARK_HOME. > Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040 > Spark context available as 'sc' (master = yarn, app id = > application_1573772077642_0006). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.read.format("csv").option("header", > true).option("enforceSchema", > false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1) > 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > [Stage 2:>(0 + 10) / > 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): > java.lang.IllegalArgumentException: CSV header does not conform to the schema. > Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, > Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, > Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, > Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, > Recipient_Primary_Business_Street_Address_Line2, Recipient_City, > Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, > Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, > Physician_License_State_code1, Physician_License_State_code2, > Physician_License_State_code3, Physician_License_State_code4, > Physician_License_State_code5, > Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, > Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, > Total_Amount_of_Payment_USDollars, Date_of_Payment, > Number_of_Payments_Included_in_Total_Amount, > Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, > City_of_Travel, State_of_Travel, Country_of_Travel, > Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, >
[jira] [Created] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-29922: -- Summary: SHOW FUNCTIONS should look up catalog/table like v2 commands Key: SPARK-29922 URL: https://issues.apache.org/jira/browse/SPARK-29922 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco SHOW FUNCTIONS should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975272#comment-16975272 ] Pablo Langa Blanco commented on SPARK-29922: I'm working on this > SHOW FUNCTIONS should look up catalog/table like v2 commands > > > Key: SPARK-29922 > URL: https://issues.apache.org/jira/browse/SPARK-29922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > SHOW FUNCTIONS should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29921) SparkContext LiveListenerBus
Arun sethia created SPARK-29921: --- Summary: SparkContext LiveListenerBus Key: SPARK-29921 URL: https://issues.apache.org/jira/browse/SPARK-29921 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4, 2.4.3, 2.4.1 Reporter: Arun sethia Hi, I am not sure what is advantage of keeping listenerBus function as package level access for org.apache.spark.SparkContext. private[spark] def listenerBus: LiveListenerBus = _listenerBus This limits anyone to publish any custom SparkListenerEvent to LiveListenerBus. Thanks, Arun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29920) Parsing failure on interval '20 15' day to hour
Maxim Gekk created SPARK-29920: -- Summary: Parsing failure on interval '20 15' day to hour Key: SPARK-29920 URL: https://issues.apache.org/jira/browse/SPARK-29920 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk {code:sql} spark-sql> select interval '20 15' day to hour; Error in query: requirement failed: Interval string must match day-time format of 'd h:m:s.n': 20 15(line 1, pos 16) == SQL == select interval '20 15' day to hour ^^^ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29919) remove python2 test execution
Shane Knapp created SPARK-29919: --- Summary: remove python2 test execution Key: SPARK-29919 URL: https://issues.apache.org/jira/browse/SPARK-29919 Project: Spark Issue Type: Sub-task Components: PySpark, Tests Affects Versions: 3.0.0 Reporter: Shane Knapp Assignee: Shane Knapp remove python2.7 (including pypy2) test executables from 'python/run-tests.py' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long
[ https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-29918: --- Labels: correctness (was: ) > RecordBinaryComparator should check endianness when compared by long > > > Key: SPARK-29918 > URL: https://issues.apache.org/jira/browse/SPARK-29918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > Labels: correctness > > If the architecture supports unaligned or the offset is 8 bytes aligned, > RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a > long. Otherwise, it will compare bytes by bytes. > However, on little-endian machine, the result of compared by a long value > and compared bytes by bytes maybe different. If the architectures in a yarn > cluster is different(Some is unaligned-access capable while others not), then > the sequence of two records after sorted is undetermined, which will result > in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long
EdisonWang created SPARK-29918: -- Summary: RecordBinaryComparator should check endianness when compared by long Key: SPARK-29918 URL: https://issues.apache.org/jira/browse/SPARK-29918 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang If the architecture supports unaligned or the offset is 8 bytes aligned, RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a long. Otherwise, it will compare bytes by bytes. However, on little-endian machine, the result of compared by a long value and compared bytes by bytes maybe different. If the architectures in a yarn cluster is different(Some is unaligned-access capable while others not), then the sequence of two records after sorted is undetermined, which will result in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29917) Provide functionality to rename Receivers on Spark Streaming Page
Burak KÖSE created SPARK-29917: -- Summary: Provide functionality to rename Receivers on Spark Streaming Page Key: SPARK-29917 URL: https://issues.apache.org/jira/browse/SPARK-29917 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 2.4.4 Reporter: Burak KÖSE In ReceiverSupervisorImpl, it is hardcoded(getSimpleName) to use the class name of the receiver. Spark should provide a functionality for users to set their custom names for receivers. It will be especially useful for users having a lot of Receiver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29902) Add listener event queue capacity configuration to documentation
[ https://issues.apache.org/jira/browse/SPARK-29902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29902. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 26529 [https://github.com/apache/spark/pull/26529] > Add listener event queue capacity configuration to documentation > > > Key: SPARK-29902 > URL: https://issues.apache.org/jira/browse/SPARK-29902 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: shahid >Assignee: shahid >Priority: Minor > Fix For: 3.1.0 > > > Add listener event queue capacity configuration to documentation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29902) Add listener event queue capacity configuration to documentation
[ https://issues.apache.org/jira/browse/SPARK-29902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29902: Assignee: shahid > Add listener event queue capacity configuration to documentation > > > Key: SPARK-29902 > URL: https://issues.apache.org/jira/browse/SPARK-29902 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: shahid >Assignee: shahid >Priority: Minor > > Add listener event queue capacity configuration to documentation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29902) Add listener event queue capacity configuration to documentation
[ https://issues.apache.org/jira/browse/SPARK-29902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29902: - Fix Version/s: (was: 3.1.0) 3.0.0 > Add listener event queue capacity configuration to documentation > > > Key: SPARK-29902 > URL: https://issues.apache.org/jira/browse/SPARK-29902 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: shahid >Assignee: shahid >Priority: Minor > Fix For: 3.0.0 > > > Add listener event queue capacity configuration to documentation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29916) spark on kubernetes fails with hadoop-3.2 due to the user not existing in executor pod
Michał Wesołowski created SPARK-29916: - Summary: spark on kubernetes fails with hadoop-3.2 due to the user not existing in executor pod Key: SPARK-29916 URL: https://issues.apache.org/jira/browse/SPARK-29916 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0 Reporter: Michał Wesołowski I'm running tests on kubernetes with spark-3.0-preview version with hadoop-3.2 libraries. I needed cloud libraries (azure in particular) support so this is build based on v3.0.0-preview tag with cloud profile since binaries provided don't contain it. I run simple computation on AKS (azure kubernetes service) with Azure Data Lake Storage gen2 and with it fails with the following error: {code:java} py4j.protocol.Py4JJavaError: An error occurred while calling o49.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.244.2.6, executor 1): java.io.IOException: There is no primary group for UGI localuser(auth:SIMPLE) at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1455) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:136) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108) {code} It looks like hadoop library was expecting the user "localuser" to exist in executor pod. This user is the one which invoked spark-submit on my local machine, however I didn't set it explicitly. I investigated the pod and this user is set in SPARK_USER environment variable in both executor and driver pods. Relevant logs from executor: {code:java} 19/11/15 12:56:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, localuser); groups with view permissions: Set(); users with modify permissions: Set(root, localuser); groups with modify permissions: Set() ... 19/11/15 12:56:53 INFO SecurityManager: Changing view acls to: root,localuser 19/11/15 12:56:53 INFO SecurityManager: Changing modify acls to: root,localuser 19/11/15 12:56:53 INFO SecurityManager: Changing view acls groups to: 19/11/15 12:56:53 INFO SecurityManager: Changing modify acls groups to: ... 19/11/15 12:57:02 WARN ShellBasedUnixGroupsMapping: unable to return groups for user localuser PartialGroupNameException The user name 'localuser' is not found. id: ‘localuser’: no such user id: ‘localuser’: no such userat org.apache.hadoop.security.ShellBasedUnixGroupsMapping.resolvePartialGroupNames(ShellBasedUnixGroupsMapping.java:294) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:207) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:97) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:51) at org.apache.hadoop.security.Groups$GroupCacheLoader.fetchGroupList(Groups.java:387) at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:321) at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:270) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.hadoop.security.Groups.getGroups(Groups.java:228) at org.apache.hadoop.security.UserGroupInformation.getGroups(UserGroupInformation.java:1588) at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1453) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:136) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) {code} One woraround for this I've found is
[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975054#comment-16975054 ] Maciej Szymkiewicz commented on SPARK-29748: [~jhereth] {quote}With simply removing sorting we change the semantics, e.g. `Row(a=1, b=2) != Row(b=2, a=1)` (opposed to what we currently have.{quote} It is even more messy. At the moment we adhere to {{tuple}} semantics so {{Row(a=1, b=2) == Row(y=1, z=2)}}. That might be acceptable (namedtuples use the same approach, but I think we should state that explicitly). {quote}I think Maciej Szymkiewicz was thinking about changes for the upcoming 3.0?{quote} Indeed. [~bryanc] Let me clarify things - I am not suggesting that any of these changes should be implemented here. Instead I think we should have clear picture what {{Row}} suppose to be (not only in terms of API, but also intended applications) before we decide on a concrete solution. That's particularly important because we already have special cases that were introduced specifically to target {{**kwargs}} and sorting behavior. That being said, if we want to discuss this case in isolation * Introducing {{LegacyRow}} seems to make little sense if implementation of {{Row}} stays the same otherwise. Sorting or not, depending on the config, should be enough. * {quote} Users with Python < 3.6 will have to create Rows with an OrderedDict or by using the Row class as a factory (explained in the pydoc). {quote} I don't think we should introduce such behavior now, when 3.5 is deprecated. Having yet another way to initialize {{Row}} will be confusing at best (and introduce new problems when using complex structures). Furthermore we already have one mechanism that provides ordered behavior independent of version. Instead I'd suggest we: * Make legacy behavior the only option for Python < 3.6. * For Python 3.6 let's introduce legacy sorting mechanism (keeping only single {{Row}}) class, enabled by default and deprecated. > Remove sorting of fields in PySpark SQL Row creation > > > Key: SPARK-29748 > URL: https://issues.apache.org/jira/browse/SPARK-29748 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently, when a PySpark Row is created with keyword arguments, the fields > are sorted alphabetically. This has created a lot of confusion with users > because it is not obvious (although it is stated in the pydocs) that they > will be sorted alphabetically, and then an error can occur later when > applying a schema and the field order does not match. > The original reason for sorting fields is because kwargs in python < 3.6 are > not guaranteed to be in the same order that they were entered. Sorting > alphabetically would ensure a consistent order. Matters are further > complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to > to be referenced by name when made by kwargs, but this flag is not serialized > with the Row and leads to inconsistent behavior. > This JIRA proposes that any sorting of the Fields is removed. Users with > Python 3.6+ creating Rows with kwargs can continue to do so since Python will > ensure the order is the same as entered. Users with Python < 3.6 will have to > create Rows with an OrderedDict or by using the Row class as a factory > (explained in the pydoc). If kwargs are used, an error will be raised or > based on a conf setting it can fall back to a LegacyRow that will sort the > fields as before. This LegacyRow will be immediately deprecated and removed > once support for Python < 3.6 is dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29915) spark-py and spark-r images are not created with docker-image-tool.sh
Michał Wesołowski created SPARK-29915: - Summary: spark-py and spark-r images are not created with docker-image-tool.sh Key: SPARK-29915 URL: https://issues.apache.org/jira/browse/SPARK-29915 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0 Reporter: Michał Wesołowski Currently at version 3.0.0-preview docker-image-tool.sh script has the [following lines|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh#L173]] defined: {code:java} local PYDOCKERFILE=${PYDOCKERFILE:-false} local RDOCKERFILE=${RDOCKERFILE:-false} {code} Because of this change spark-py nor spark-r images get created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big
[ https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhanxiongWang updated SPARK-29114: -- Description: Updated time:15/Nov/19 We saw this blog solved our confusion. [http://www.russellspitzer.com/2018/05/10/SparkPartitions/|http://www.russellspitzer.com/2018/05/10/SparkPartitions/] Updated time:15/Nov/19 I discussed this issue with my colleagues today. We think that spark has caused cross-border problems in the process of doing shuffle. The problem may be in the Sort-based Shuffle stage. When the map task partition is too large, and the storage of the writerIndex variable uses int, writerIndex may cause cross-border problems. If this is the case, the variable writerIndex {color:#de350b}replaces int with long{color} should solve the current problem. I create a Dataset df with 200 partitions. I applied for 100 executors for my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I use df.cache() before df.coalesce(10). When{color:#de350b} Dataset partition{color} {color:#de350b}size is small{color}, the program works well. But when I {color:#de350b}increase{color} the size of the Dataset partition , the function {color:#de350b}df.coalesce(10){color} will throw ChunkFetchFailureException. 19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210) 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 1003 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as bytes in memory (estimated size 49.4 KB, free 3.8 GB) 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 7 ms 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in memory (estimated size 154.5 KB, free 3.8 GB) 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps) 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block rdd_1005_18, and will not retry (0 retries) org.apache.spark.network.client.ChunkFetchFailureException: Failure while fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= capacity(-2137154997)) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch failures. Most
[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big
[ https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhanxiongWang updated SPARK-29114: -- Description: Updated time:15/Nov/19 We saw this blog solved our confusion. [http://www.russellspitzer.com/2018/05/10/SparkPartitions/|http://example.com] Updated time:15/Nov/19 I discussed this issue with my colleagues today. We think that spark has caused cross-border problems in the process of doing shuffle. The problem may be in the Sort-based Shuffle stage. When the map task partition is too large, and the storage of the writerIndex variable uses int, writerIndex may cause cross-border problems. If this is the case, the variable writerIndex {color:#de350b}replaces int with long{color} should solve the current problem. I create a Dataset df with 200 partitions. I applied for 100 executors for my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I use df.cache() before df.coalesce(10). When{color:#de350b} Dataset partition{color} {color:#de350b}size is small{color}, the program works well. But when I {color:#de350b}increase{color} the size of the Dataset partition , the function {color:#de350b}df.coalesce(10){color} will throw ChunkFetchFailureException. 19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210) 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 1003 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as bytes in memory (estimated size 49.4 KB, free 3.8 GB) 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 7 ms 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in memory (estimated size 154.5 KB, free 3.8 GB) 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps) 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block rdd_1005_18, and will not retry (0 retries) org.apache.spark.network.client.ChunkFetchFailureException: Failure while fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= capacity(-2137154997)) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch failures. Most recent failure cause:
[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big
[ https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhanxiongWang updated SPARK-29114: -- Description: Updated time:15/Nov/19 We saw this blog solved our confusion. [[http://www.russellspitzer.com/2018/05/10/SparkPartitions/ ||http://example.com/] [http://www.russellspitzer.com/2018/05/10/SparkPartitions/|http://example.com/] []|http://example.com/] Updated time:15/Nov/19 I discussed this issue with my colleagues today. We think that spark has caused cross-border problems in the process of doing shuffle. The problem may be in the Sort-based Shuffle stage. When the map task partition is too large, and the storage of the writerIndex variable uses int, writerIndex may cause cross-border problems. If this is the case, the variable writerIndex {color:#de350b}replaces int with long{color} should solve the current problem. I create a Dataset df with 200 partitions. I applied for 100 executors for my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I use df.cache() before df.coalesce(10). When{color:#de350b} Dataset partition{color} {color:#de350b}size is small{color}, the program works well. But when I {color:#de350b}increase{color} the size of the Dataset partition , the function {color:#de350b}df.coalesce(10){color} will throw ChunkFetchFailureException. 19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210) 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 1003 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as bytes in memory (estimated size 49.4 KB, free 3.8 GB) 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 7 ms 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in memory (estimated size 154.5 KB, free 3.8 GB) 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps) 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block rdd_1005_18, and will not retry (0 retries) org.apache.spark.network.client.ChunkFetchFailureException: Failure while fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= capacity(-2137154997)) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) 19/09/17 08:26:46
[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big
[ https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhanxiongWang updated SPARK-29114: -- Description: Updated time:15/Nov/19 I discussed this issue with my colleagues today. We think that spark has caused cross-border problems in the process of doing shuffle. The problem may be in the Sort-based Shuffle stage. When the map task partition is too large, and the storage of the writerIndex variable uses int, writerIndex may cause cross-border problems. If this is the case, the variable writerIndex {color:#de350b}replaces int with long{color} should solve the current problem. I create a Dataset df with 200 partitions. I applied for 100 executors for my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I use df.cache() before df.coalesce(10). When{color:#de350b} Dataset partition{color} {color:#de350b}size is small{color}, the program works well. But when I {color:#de350b}increase{color} the size of the Dataset partition , the function {color:#de350b}df.coalesce(10){color} will throw ChunkFetchFailureException. 19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210) 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 1003 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as bytes in memory (estimated size 49.4 KB, free 3.8 GB) 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 7 ms 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in memory (estimated size 154.5 KB, free 3.8 GB) 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps) 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block rdd_1005_18, and will not retry (0 retries) org.apache.spark.network.client.ChunkFetchFailureException: Failure while fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= capacity(-2137154997)) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch failures. Most recent failure cause: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at
[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset partition size is big
[ https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhanxiongWang updated SPARK-29114: -- Description: Updated time:15/Nov/19 I discussed this issue with my colleagues today. We think that spark has caused cross-border problems in the process of doing shuffle. The problem may be in the Sort-based Shuffle stage. When the map task partition is too large, and the storage of the Index variable uses int, Index may cause cross-border problems. If this is the case, the variable index {color:#de350b}replaces int with long{color} should solve the current problem. I create a Dataset df with 200 partitions. I applied for 100 executors for my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I use df.cache() before df.coalesce(10). When{color:#de350b} Dataset partition{color} {color:#de350b}size is small{color}, the program works well. But when I {color:#de350b}increase{color} the size of the Dataset partition , the function {color:#de350b}df.coalesce(10){color} will throw ChunkFetchFailureException. 19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210) 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 1003 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as bytes in memory (estimated size 49.4 KB, free 3.8 GB) 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 7 ms 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in memory (estimated size 154.5 KB, free 3.8 GB) 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps) 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block rdd_1005_18, and will not retry (0 retries) org.apache.spark.network.client.ChunkFetchFailureException: Failure while fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= capacity(-2137154997)) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch failures. Most recent failure cause: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at
[jira] [Created] (SPARK-29914) ML models append metadata in `transform`/`transformSchema`
zhengruifeng created SPARK-29914: Summary: ML models append metadata in `transform`/`transformSchema` Key: SPARK-29914 URL: https://issues.apache.org/jira/browse/SPARK-29914 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method. However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29913) Improve Exception in postgreCastToBoolean
jobit mathew created SPARK-29913: Summary: Improve Exception in postgreCastToBoolean Key: SPARK-29913 URL: https://issues.apache.org/jira/browse/SPARK-29913 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Improve Exception in postgreCastToBoolean -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29619) Add retry times when reading the daemon port.
[ https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29619. -- Resolution: Won't Fix > Add retry times when reading the daemon port. > - > > Key: SPARK-29619 > URL: https://issues.apache.org/jira/browse/SPARK-29619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > This ticket is related to https://issues.apache.org/jira/browse/SPARK-29885 > and add try mechanism. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
[ https://issues.apache.org/jira/browse/SPARK-29894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-29894: Attachment: Physical_plan_Annotated.png > Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab > --- > > Key: SPARK-29894 > URL: https://issues.apache.org/jira/browse/SPARK-29894 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Attachments: Physical_plan_Annotated.png, > snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png, > snippet_plan_graph_before_patch.png > > > The Web UI SQL Tab provides information on the executed SQL using plan graphs > and SQL execution plans. Both provide useful information. Physical execution > plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also > reported in the plan graphs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition
[ https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-29912: Description: It is meaningless to do `repartition(1)` or `coalesce(1)` when a child plan just output one partition. Now, we can not get the output numPartitions during logic plan, so this issue pruning the operation in physical plan. was: It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child just output one partition. Now, we can not get the output numPartitions during logic plan, so this issue pruning the operation in physical plan. > Pruning shuffle exchange and coalesce when input and output both are one > partition > -- > > Key: SPARK-29912 > URL: https://issues.apache.org/jira/browse/SPARK-29912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > It is meaningless to do `repartition(1)` or `coalesce(1)` when a child plan > just output one partition. > Now, we can not get the output numPartitions during logic plan, so this issue > pruning the operation in physical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition
[ https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-29912: Description: It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child just have one partition. Now, we can not get the output numPartitions during logic plan, so this issue pruning the operation in physical plan. was: It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child just have one partition. Now, we can not get the output numPartitions during logic plan, so the issue pruning the operation in physical plan. > Pruning shuffle exchange and coalesce when input and output both are one > partition > -- > > Key: SPARK-29912 > URL: https://issues.apache.org/jira/browse/SPARK-29912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child > just have one partition. > Now, we can not get the output numPartitions during logic plan, so this issue > pruning the operation in physical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition
[ https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-29912: Description: It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child just output one partition. Now, we can not get the output numPartitions during logic plan, so this issue pruning the operation in physical plan. was: It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child just have one partition. Now, we can not get the output numPartitions during logic plan, so this issue pruning the operation in physical plan. > Pruning shuffle exchange and coalesce when input and output both are one > partition > -- > > Key: SPARK-29912 > URL: https://issues.apache.org/jira/browse/SPARK-29912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child > just output one partition. > Now, we can not get the output numPartitions during logic plan, so this issue > pruning the operation in physical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29912) Pruning shuffle exchange and coalesce when input and output both are one partition
[ https://issues.apache.org/jira/browse/SPARK-29912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-29912: Summary: Pruning shuffle exchange and coalesce when input and output both are one partition (was: Pruning shuffle exchange when input and output both are one partition) > Pruning shuffle exchange and coalesce when input and output both are one > partition > -- > > Key: SPARK-29912 > URL: https://issues.apache.org/jira/browse/SPARK-29912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child > just have one partition. > Now, we can not get the output numPartitions during logic plan, so the issue > pruning the operation in physical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29912) Pruning shuffle exchange when input and output both are one partition
ulysses you created SPARK-29912: --- Summary: Pruning shuffle exchange when input and output both are one partition Key: SPARK-29912 URL: https://issues.apache.org/jira/browse/SPARK-29912 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: ulysses you It is meaningless to do `repartition(1)` or `coalesce(1)` when a plan child just have one partition. Now, we can not get the output numPartitions during logic plan, so the issue pruning the operation in physical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org