[jira] [Assigned] (SPARK-31521) The fetch size is not correct when merging blocks into a merged block

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31521:
-

Assignee: wuyi

> The fetch size is not correct when merging blocks into a merged block
> -
>
> Key: SPARK-31521
> URL: https://issues.apache.org/jira/browse/SPARK-31521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> When merging blocks into a merged block, we should count the size of that 
> merged block as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31521) The fetch size is not correct when merging blocks into a merged block

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31521.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28301
[https://github.com/apache/spark/pull/28301]

> The fetch size is not correct when merging blocks into a merged block
> -
>
> Key: SPARK-31521
> URL: https://issues.apache.org/jira/browse/SPARK-31521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> When merging blocks into a merged block, we should count the size of that 
> merged block as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31516.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28292
[https://github.com/apache/spark/pull/28292]

> Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
> ---
>
> Key: SPARK-31516
> URL: https://issues.apache.org/jira/browse/SPARK-31516
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 3.0.0
>Reporter: ZHANG Wei
>Assignee: ZHANG Wei
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is a duplicated `hiveClientCalls.count` metric in both 
> `namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists 
> of [Spark Monitoring 
> doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
>  but there is only one inside object HiveCatalogMetrics in [source 
> code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
> {quote} * namespace=HiveExternalCatalog
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** fileCacheHits.count
>  ** filesDiscovered.count
>  ** +{color:#ff}*hiveClientCalls.count*{color}+
>  ** parallelListingJobCount.count
>  ** partitionsFetched.count
>  * namespace=CodeGenerator
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** compilationTime (histogram)
>  ** generatedClassSize (histogram)
>  ** generatedMethodSize (histogram)
>  ** *{color:#ff}+hiveClientCalls.count+{color}*
>  ** sourceCodeSize (histogram){quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31516:
-

Assignee: ZHANG Wei

> Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
> ---
>
> Key: SPARK-31516
> URL: https://issues.apache.org/jira/browse/SPARK-31516
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 3.0.0
>Reporter: ZHANG Wei
>Assignee: ZHANG Wei
>Priority: Minor
>
> There is a duplicated `hiveClientCalls.count` metric in both 
> `namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists 
> of [Spark Monitoring 
> doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
>  but there is only one inside object HiveCatalogMetrics in [source 
> code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
> {quote} * namespace=HiveExternalCatalog
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** fileCacheHits.count
>  ** filesDiscovered.count
>  ** +{color:#ff}*hiveClientCalls.count*{color}+
>  ** parallelListingJobCount.count
>  ** partitionsFetched.count
>  * namespace=CodeGenerator
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** compilationTime (histogram)
>  ** generatedClassSize (histogram)
>  ** generatedMethodSize (histogram)
>  ** *{color:#ff}+hiveClientCalls.count+{color}*
>  ** sourceCodeSize (histogram){quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31560) Add V1/V2 tests for TextSuite and WholeTextFileSuite

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31560.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28335
[https://github.com/apache/spark/pull/28335]

> Add V1/V2 tests for TextSuite and WholeTextFileSuite
> 
>
> Key: SPARK-31560
> URL: https://issues.apache.org/jira/browse/SPARK-31560
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20732) Copy cache data when node is being shut down

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-20732:
---

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20732) Copy cache data when node is being shut down

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20732:
--
Fix Version/s: (was: 3.1.0)

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-24 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092012#comment-17092012
 ] 

Jungtaek Lim commented on SPARK-31554:
--

There're two existing PRs addressing the test suite:

https://github.com/apache/spark/pull/28156
https://github.com/apache/spark/pull/28055


> Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
> 
>
> Key: SPARK-31554
> URL: https://issues.apache.org/jira/browse/SPARK-31554
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, 
> for example:
> * https://github.com/apache/spark/pull/28328#issuecomment-618992335
> The error message:
> {code}
> org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
> reporting
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
> error line 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
>   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
>   at 
> scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
>   at 
> org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
> {code}
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31561) Add QUALIFY Clause

2020-04-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-31561:
---

 Summary: Add QUALIFY Clause
 Key: SPARK-31561
 URL: https://issues.apache.org/jira/browse/SPARK-31561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


In a SELECT statement, the QUALIFY clause filters the results of window 
functions.

QUALIFY does with window functions what HAVING does with aggregate functions 
and GROUP BY clauses.

In the execution order of a query, QUALIFY is therefore evaluated after window 
functions are computed.

Examples:
https://docs.snowflake.com/en/sql-reference/constructs/qualify.html#examples

More details:
https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31560) Add V1/V2 tests for TextSuite and WholeTextFileSuite

2020-04-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-31560:
--

 Summary: Add V1/V2 tests for TextSuite and WholeTextFileSuite
 Key: SPARK-31560
 URL: https://issues.apache.org/jira/browse/SPARK-31560
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0, 3.0.1, 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31559) AM starts with initial fetched tokens in any attempt

2020-04-24 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-31559:


 Summary: AM starts with initial fetched tokens in any attempt
 Key: SPARK-31559
 URL: https://issues.apache.org/jira/browse/SPARK-31559
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


The issue is only occurred in yarn-cluster mode.

Submitter will obtain delegation tokens for yarn-cluster mode, and add these 
credentials to the launch context. AM will be launched with these credentials, 
and AM and driver are able to leverage these tokens.

In Yarn cluster mode, driver is launched in AM, which in turn initializes token 
manager (while initializing SparkContext) and obtain delegation tokens (+ 
schedule to renew) if both principal and keytab are available.

That said, even we provide principal and keytab to run application with 
yarn-cluster mode, AM always starts with initial tokens from launch context 
until token manager runs and obtains delegation tokens.

So there's a "gap", and if user codes (driver) access to external system with 
delegation tokens (e.g. HDFS) before initializing SparkContext, it cannot 
leverage the tokens token manager will obtain. It will make the application 
fail if AM is killed "after" the initial tokens are expired and relaunched.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31552.
---
Fix Version/s: 3.0.0
 Assignee: Kent Yao
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/28324

> Fix potential ClassCastException in ScalaReflection arrayClassFor
> -
>
> Key: SPARK-31552
> URL: https://issues.apache.org/jira/browse/SPARK-31552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
> the cases in dataTypeFor are not fully handled in arrayClassFor
> For example:
> {code:java}
> scala> import scala.reflect.runtime.universe.TypeTag
> scala> import org.apache.spark.sql._
> scala> import org.apache.spark.sql.catalyst.encoders._
> scala> import org.apache.spark.sql.types._
> scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
> ExpressionEncoder()
> newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
> scala> val decOne = Decimal(1, 38, 18)
> decOne: org.apache.spark.sql.types.Decimal = 1E-18
> scala> val decTwo = Decimal(2, 38, 18)
> decTwo: org.apache.spark.sql.types.Decimal = 2E-18
> scala> val decSpark = Array(decOne, decTwo)
> decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
> scala> Seq(decSpark).toDF()
> java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot 
> be cast to org.apache.spark.sql.types.ObjectType
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
>   at newArrayEncoder(:57)
>   ... 53 elided
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31533) Enable DB2IntegrationSuite test and upgrade the DB2 docker inside

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31533:
-

Assignee: Gabor Somogyi

> Enable DB2IntegrationSuite test and upgrade the DB2 docker inside
> -
>
> Key: SPARK-31533
> URL: https://issues.apache.org/jira/browse/SPARK-31533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31533) Enable DB2IntegrationSuite test and upgrade the DB2 docker inside

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31533.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28325
[https://github.com/apache/spark/pull/28325]

> Enable DB2IntegrationSuite test and upgrade the DB2 docker inside
> -
>
> Key: SPARK-31533
> URL: https://issues.apache.org/jira/browse/SPARK-31533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31546) Backport SPARK-25595 Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled

2020-04-24 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091991#comment-17091991
 ] 

Gengliang Wang commented on SPARK-31546:


I have created backport PR for this: https://github.com/apache/spark/pull/28334

> Backport SPARK-25595   Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> 
>
> Key: SPARK-31546
> URL: https://issues.apache.org/jira/browse/SPARK-31546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25595       Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> cc [~Gengliang.Wang]& [~hyukjin.kwon] for comments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31558) Code cleanup in spark-sql-viz.js

2020-04-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-31558:
---
Summary: Code cleanup in spark-sql-viz.js  (was: Code clean up in 
spark-sql-viz.js)

> Code cleanup in spark-sql-viz.js
> 
>
> Key: SPARK-31558
> URL: https://issues.apache.org/jira/browse/SPARK-31558
> Project: Spark
>  Issue Type: Task
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> 1. Remove console.log(), which seems unnecessary in release.
> 2. Replace the double equals to triple equals
> 3. Reuse jquery selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31558) Code clean up in spark-sql-viz.js

2020-04-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-31558:
--

 Summary: Code clean up in spark-sql-viz.js
 Key: SPARK-31558
 URL: https://issues.apache.org/jira/browse/SPARK-31558
 Project: Spark
  Issue Type: Task
  Components: Web UI
Affects Versions: 3.0.0, 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


1. Remove console.log(), which seems unnecessary in release.
2. Replace the double equals to triple equals
3. Reuse jquery selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31557) Legacy parser incorrectly interprets pre-Gregorian dates

2020-04-24 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-31557:
-

 Summary: Legacy parser incorrectly interprets pre-Gregorian dates
 Key: SPARK-31557
 URL: https://issues.apache.org/jira/browse/SPARK-31557
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Bruce Robbins


With CSV:
{noformat}
scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
"1800-01-01").map(x => s"$x,$x")
seq: Seq[String] = List(0002-01-01,0002-01-01, 1000-01-01,1000-01-01, 
1500-01-01,1500-01-01, 1800-01-01,1800-01-01)

scala> val ds = seq.toDF("value").as[String]
ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> spark.read.schema("expected STRING, actual DATE").csv(ds).show
+--+--+
|  expected|actual|
+--+--+
|0002-01-01|0001-12-30|
|1000-01-01|1000-01-06|
|1500-01-01|1500-01-10|
|1800-01-01|1800-01-01|
+--+--+

scala> 
{noformat}
Similarly, with JSON:
{noformat}
scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
"1800-01-01").map { x =>
  s"""{"expected": "$x", "actual": "$x"}"""
}

 |  | seq: Seq[String] = List({"expected": "0002-01-01", "actual": 
"0002-01-01"}, {"expected": "1000-01-01", "actual": "1000-01-01"}, {"expected": 
"1500-01-01", "actual": "1500-01-01"}, {"expected": "1800-01-01", "actual": 
"1800-01-01"})

scala> 
scala> val ds = seq.toDF("value").as[String]
ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> spark.read.schema("expected STRING, actual DATE").json(ds).show
+--+--+
|  expected|actual|
+--+--+
|0002-01-01|0001-12-30|
|1000-01-01|1000-01-06|
|1500-01-01|1500-01-10|
|1800-01-01|1800-01-01|
+--+--+

scala> 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31491) Re-arrange Data Types page to document Floating Point Special Values

2020-04-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31491.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28264

> Re-arrange Data Types page to document Floating Point Special Values
> 
>
> Key: SPARK-31491
> URL: https://issues.apache.org/jira/browse/SPARK-31491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession

2020-04-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31532.
--
Fix Version/s: 2.4.6
 Assignee: Kent Yao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28316

> SparkSessionBuilder shoud not propagate static sql configurations to the 
> existing active/default SparkSession
> -
>
> Key: SPARK-31532
> URL: https://issues.apache.org/jira/browse/SPARK-31532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 2.4.6
>
>
> Clearly, this is a bug.
> {code:java}
> scala> spark.sql("set spark.sql.warehouse.dir").show
> +++
> | key|   value|
> +++
> |spark.sql.warehou...|file:/Users/kenty...|
> +++
> scala> spark.sql("set spark.sql.warehouse.dir=2");
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.sql.warehouse.dir;
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>   ... 47 elided
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
> getClass   getOrCreate
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", 
> "xyz").getOrCreate
> 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; 
> some configuration may not take effect.
> res7: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@6403d574
> scala> spark.sql("set spark.sql.warehouse.dir").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.warehou...|  xyz|
> ++-+
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31556) Document LIKE clause in SQL Reference

2020-04-24 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091965#comment-17091965
 ] 

Huaxin Gao commented on SPARK-31556:


https://github.com/apache/spark/pull/28332

> Document LIKE clause in SQL Reference
> -
>
> Key: SPARK-31556
> URL: https://issues.apache.org/jira/browse/SPARK-31556
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Document LIKE clause in SQL Reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31556) Document LIKE clause in SQL Reference

2020-04-24 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31556:
--

 Summary: Document LIKE clause in SQL Reference
 Key: SPARK-31556
 URL: https://issues.apache.org/jira/browse/SPARK-31556
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Document LIKE clause in SQL Reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31364) Benchmark Nested Parquet Predicate Pushdown

2020-04-24 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31364.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28319
[https://github.com/apache/spark/pull/28319]

> Benchmark Nested Parquet Predicate Pushdown
> ---
>
> Key: SPARK-31364
> URL: https://issues.apache.org/jira/browse/SPARK-31364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> We would like to benchmark best and worst scenarios such as no record matches 
> the predicate, and how much extra overhead is added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31364) Benchmark Nested Parquet Predicate Pushdown

2020-04-24 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31364:

Summary: Benchmark Nested Parquet Predicate Pushdown  (was: Benchmark 
Parquet Predicate Pushdown)

> Benchmark Nested Parquet Predicate Pushdown
> ---
>
> Key: SPARK-31364
> URL: https://issues.apache.org/jira/browse/SPARK-31364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>
> We would like to benchmark best and worst scenarios such as no record matches 
> the predicate, and how much extra overhead is added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite

2020-04-24 Thread Srinivas Rishindra Pothireddi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-31377:
--
Description: 
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi
 * BroadcastNestedLoopJoin: RightOuter
 * BroadcastHashJoin: LeftAnti

  was:
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * SortMergeJoin: ExistenceJoin
 * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin
 * BroadcastNestedLoopJoin: RightOuter, InnerJoin, ExistenceJoin
 * BroadcastHashJoin: LeftAnti, ExistenceJoin


> Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
> --
>
> Key: SPARK-31377
> URL: https://issues.apache.org/jira/browse/SPARK-31377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> For some combinations of join algorithm and join types there are no unit 
> tests for the "number of output rows" metric.
> A list of missing unit tests include the following.
>  * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi
>  * BroadcastNestedLoopJoin: RightOuter
>  * BroadcastHashJoin: LeftAnti



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-24 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091833#comment-17091833
 ] 

Pablo Langa Blanco commented on SPARK-31500:


Hi [~ewasserman],

This is a scala base problem, equality between arrays is not behaving as 
expected.

[https://blog.bruchez.name/2013/05/scala-array-comparison-without-phd.html]

I'm going to work to find a solution, but here is a workaround, change the 
definition of the case class and put Seq instead of Array and it will work as 
expected.
{code:java}
case class R(id: String, value: String, bytes: Seq[Byte]){code}
 

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20732) Copy cache data when node is being shut down

2020-04-24 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-20732.
--
   Fix Version/s: 3.1.0
Target Version/s: 3.1.0
  Resolution: Fixed

Fixed, thank you!

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20732) Copy cache data when node is being shut down

2020-04-24 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau reassigned SPARK-20732:


Assignee: Prakhar Jain

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31555) Improve cache block migration

2020-04-24 Thread Holden Karau (Jira)
Holden Karau created SPARK-31555:


 Summary: Improve cache block migration
 Key: SPARK-31555
 URL: https://issues.apache.org/jira/browse/SPARK-31555
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Holden Karau


We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Do we want to prioritize migrating blocks with no replicas

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31007) KMeans optimization based on triangle-inequality

2020-04-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31007.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/27758

> KMeans optimization based on triangle-inequality
> 
>
> Key: SPARK-31007
> URL: https://issues.apache.org/jira/browse/SPARK-31007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: ICML03-022.pdf
>
>
> In current impl, following Lemma is used in KMeans:
> 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= 
> |(d(x,o) - d(c,o))| = |norm(x)-norm(c)|
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> According to [Using the Triangle Inequality to Accelerate 
> K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go 
> futher, and there are another two Lemmas can be used:
> 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then 
> d(x,c) >= d(x,b);
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}.
> However, luckily for CosineDistance we can get a variant in the space of 
> radian/angle.
> 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, 
> d(x,b)-d(b,c)};
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> The application of Lemma 2 is a little complex: It need to cache/update the 
> distance/lower bounds to previous centers, and thus can be only applied in 
> training, not usable in prediction.
> So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is 
> close to center b enough (less than a pre-computed radius), then we can say 
> point x belong to center c without computing the distances between x and 
> other centers. It can be used in both training and predction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31539) Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation)

2020-04-24 Thread Dylan Guedes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091705#comment-17091705
 ] 

Dylan Guedes commented on SPARK-31539:
--

Agreed, I think it is not worth it.

> Backport SPARK-27138   Remove AdminUtils calls (fixes deprecation)
> --
>
> Key: SPARK-31539
> URL: https://issues.apache.org/jira/browse/SPARK-31539
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27138       Remove AdminUtils calls (fixes deprecation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases

2020-04-24 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091699#comment-17091699
 ] 

Kazuaki Ishizaki commented on SPARK-31538:
--

We could backport this. On the other hand, this is not a bug fix. As far as I 
know, this change does not find new issues immediately.  
If we have already found problems related to this, they should have been 
backported to the 2.4 branch.

I think that this is a nice-to-have in the maintenance branch.

> Backport SPARK-25338   Ensure to call super.beforeAll() and 
> super.afterAll() in test cases
> --
>
> Key: SPARK-31538
> URL: https://issues.apache.org/jira/browse/SPARK-31538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25338       Ensure to call super.beforeAll() and 
> super.afterAll() in test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31553:
--
Labels: correctness  (was: )

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>  Labels: correctness
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-24 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091636#comment-17091636
 ] 

Wenchen Fan commented on SPARK-31554:
-

[~Qin Yao] do you have any clue?

> Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
> 
>
> Key: SPARK-31554
> URL: https://issues.apache.org/jira/browse/SPARK-31554
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, 
> for example:
> * https://github.com/apache/spark/pull/28328#issuecomment-618992335
> The error message:
> {code}
> org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
> reporting
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
> error line 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
>   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
>   at 
> scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
>   at 
> org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
> {code}
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091633#comment-17091633
 ] 

Dongjoon Hyun commented on SPARK-31552:
---

Hi, [~Qin Yao]. I updated the Affected Version by adding 2.0.2 ~ 2.4.5.

> Fix potential ClassCastException in ScalaReflection arrayClassFor
> -
>
> Key: SPARK-31552
> URL: https://issues.apache.org/jira/browse/SPARK-31552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
> the cases in dataTypeFor are not fully handled in arrayClassFor
> For example:
> {code:java}
> scala> import scala.reflect.runtime.universe.TypeTag
> scala> import org.apache.spark.sql._
> scala> import org.apache.spark.sql.catalyst.encoders._
> scala> import org.apache.spark.sql.types._
> scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
> ExpressionEncoder()
> newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
> scala> val decOne = Decimal(1, 38, 18)
> decOne: org.apache.spark.sql.types.Decimal = 1E-18
> scala> val decTwo = Decimal(2, 38, 18)
> decTwo: org.apache.spark.sql.types.Decimal = 2E-18
> scala> val decSpark = Array(decOne, decTwo)
> decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
> scala> Seq(decSpark).toDF()
> java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot 
> be cast to org.apache.spark.sql.types.ObjectType
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
>   at newArrayEncoder(:57)
>   ... 53 elided
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Affects Version/s: 2.0.2
   2.1.3

> Fix potential ClassCastException in ScalaReflection arrayClassFor
> -
>
> Key: SPARK-31552
> URL: https://issues.apache.org/jira/browse/SPARK-31552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
> the cases in dataTypeFor are not fully handled in arrayClassFor
> For example:
> {code:java}
> scala> import scala.reflect.runtime.universe.TypeTag
> scala> import org.apache.spark.sql._
> scala> import org.apache.spark.sql.catalyst.encoders._
> scala> import org.apache.spark.sql.types._
> scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
> ExpressionEncoder()
> newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
> scala> val decOne = Decimal(1, 38, 18)
> decOne: org.apache.spark.sql.types.Decimal = 1E-18
> scala> val decTwo = Decimal(2, 38, 18)
> decTwo: org.apache.spark.sql.types.Decimal = 2E-18
> scala> val decSpark = Array(decOne, decTwo)
> decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
> scala> Seq(decSpark).toDF()
> java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot 
> be cast to org.apache.spark.sql.types.ObjectType
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
>   at newArrayEncoder(:57)
>   ... 53 elided
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Affects Version/s: 2.2.3

> Fix potential ClassCastException in ScalaReflection arrayClassFor
> -
>
> Key: SPARK-31552
> URL: https://issues.apache.org/jira/browse/SPARK-31552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
> the cases in dataTypeFor are not fully handled in arrayClassFor
> For example:
> {code:java}
> scala> import scala.reflect.runtime.universe.TypeTag
> scala> import org.apache.spark.sql._
> scala> import org.apache.spark.sql.catalyst.encoders._
> scala> import org.apache.spark.sql.types._
> scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
> ExpressionEncoder()
> newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
> scala> val decOne = Decimal(1, 38, 18)
> decOne: org.apache.spark.sql.types.Decimal = 1E-18
> scala> val decTwo = Decimal(2, 38, 18)
> decTwo: org.apache.spark.sql.types.Decimal = 2E-18
> scala> val decSpark = Array(decOne, decTwo)
> decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
> scala> Seq(decSpark).toDF()
> java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot 
> be cast to org.apache.spark.sql.types.ObjectType
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
>   at newArrayEncoder(:57)
>   ... 53 elided
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Affects Version/s: 2.3.4

> Fix potential ClassCastException in ScalaReflection arrayClassFor
> -
>
> Key: SPARK-31552
> URL: https://issues.apache.org/jira/browse/SPARK-31552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
> the cases in dataTypeFor are not fully handled in arrayClassFor
> For example:
> {code:java}
> scala> import scala.reflect.runtime.universe.TypeTag
> scala> import org.apache.spark.sql._
> scala> import org.apache.spark.sql.catalyst.encoders._
> scala> import org.apache.spark.sql.types._
> scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
> ExpressionEncoder()
> newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
> scala> val decOne = Decimal(1, 38, 18)
> decOne: org.apache.spark.sql.types.Decimal = 1E-18
> scala> val decTwo = Decimal(2, 38, 18)
> decTwo: org.apache.spark.sql.types.Decimal = 2E-18
> scala> val decSpark = Array(decOne, decTwo)
> decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
> scala> Seq(decSpark).toDF()
> java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot 
> be cast to org.apache.spark.sql.types.ObjectType
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
>   at newArrayEncoder(:57)
>   ... 53 elided
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Description: 
arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
the cases in dataTypeFor are not fully handled in arrayClassFor

For example:

{code:java}
scala> import scala.reflect.runtime.universe.TypeTag
scala> import org.apache.spark.sql._
scala> import org.apache.spark.sql.catalyst.encoders._
scala> import org.apache.spark.sql.types._
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be 
cast to org.apache.spark.sql.types.ObjectType
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(:57)
  ... 53 elided

scala>
{code}


  was:
arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
the cases in dataTypeFor are not fully handled in arrayClassFor

For example:

{code:java}
scala> import scala.reflect.runtime.universe.TypeTag
scala> import org.apache.spark.sql._
scala> import org.apache.spark.sql.catalyst.encoders._
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be 
cast to org.apache.spark.sql.types.ObjectType
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at 

[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Affects Version/s: 2.4.5

> Fix potential ClassCastException in ScalaReflection arrayClassFor
> -
>
> Key: SPARK-31552
> URL: https://issues.apache.org/jira/browse/SPARK-31552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
> the cases in dataTypeFor are not fully handled in arrayClassFor
> For example:
> {code:java}
> scala> import scala.reflect.runtime.universe.TypeTag
> scala> import org.apache.spark.sql._
> scala> import org.apache.spark.sql.catalyst.encoders._
> scala> import org.apache.spark.sql.types._
> scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
> ExpressionEncoder()
> newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
> scala> val decOne = Decimal(1, 38, 18)
> decOne: org.apache.spark.sql.types.Decimal = 1E-18
> scala> val decTwo = Decimal(2, 38, 18)
> decTwo: org.apache.spark.sql.types.Decimal = 2E-18
> scala> val decSpark = Array(decOne, decTwo)
> decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
> scala> Seq(decSpark).toDF()
> java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot 
> be cast to org.apache.spark.sql.types.ObjectType
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
>   at newArrayEncoder(:57)
>   ... 53 elided
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Description: 
arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
the cases in dataTypeFor are not fully handled in arrayClassFor

For example:

{code:java}
scala> import scala.reflect.runtime.universe.TypeTag
scala> import org.apache.spark.sql._
scala> import org.apache.spark.sql.catalyst.encoders._
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be 
cast to org.apache.spark.sql.types.ObjectType
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(:57)
  ... 53 elided

scala>
{code}


  was:
arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
the cases in dataTypeFor are not fully handled in arrayClassFor

For example:

{code:java}
import scala.reflect.runtime.universe.TypeTag
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders._
{code:java}

{code:java}
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be 
cast to org.apache.spark.sql.types.ObjectType
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 

[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Description: 
arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
the cases in dataTypeFor are not fully handled in arrayClassFor

For example:

{code:java}
import scala.reflect.runtime.universe.TypeTag
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders._
{code:java}

{code:java}
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be 
cast to org.apache.spark.sql.types.ObjectType
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(:57)
  ... 53 elided

scala>
{code}


  was:
arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
the cases in dataTypeFor are not fully handled in arrayClassFor

For example:


{code:java}
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be 
cast to org.apache.spark.sql.types.ObjectType
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 

[jira] [Commented] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-24 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091614#comment-17091614
 ] 

Maxim Gekk commented on SPARK-31554:


[~cloud_fan] [~hyukjin.kwon] Can I we disable the flaky test till someone makes 
it stable?

> Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
> 
>
> Key: SPARK-31554
> URL: https://issues.apache.org/jira/browse/SPARK-31554
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, 
> for example:
> * https://github.com/apache/spark/pull/28328#issuecomment-618992335
> The error message:
> {code}
> org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
> reporting
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
> error line 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
>   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
>   at 
> scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
>   at 
> org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
> {code}
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-24 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31554:
--

 Summary: Flaky test suite 
org.apache.spark.sql.hive.thriftserver.CliSuite
 Key: SPARK-31554
 URL: https://issues.apache.org/jira/browse/SPARK-31554
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, for 
example:
* https://github.com/apache/spark/pull/28328#issuecomment-618992335
The error message:
{code}
org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
reporting
Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
error line 'Exception in thread "main" org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
at 
scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
at 
org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
{code}
* https://github.com/apache/spark/pull/28261#issuecomment-618950225
* https://github.com/apache/spark/pull/28261#issuecomment-618950225
* https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc

2020-04-24 Thread JinxinTang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-31550:
---
Comment: was deleted

(was: try to specify conf in spark-defaults.conf

spark.sql.warehouse.dir /tmp
spark.sql.session.timeZone America/New_York

It not seems a bug)

> nondeterministic configurations with general meanings in sql configuration doc
> --
>
> Key: SPARK-31550
> URL: https://issues.apache.org/jira/browse/SPARK-31550
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> spark.sql.session.timeZone
> spark.sql.warehouse.dir
>  
> these 2 configs are nondeterministic and vary with environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession

2020-04-24 Thread JinxinTang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-31532:
---
Comment: was deleted

(was: Thanks for your issue, these followings may not be allowed to modify 
after sparksession startup by design:

[spark.sql.codegen.comments, spark.sql.queryExecutionListeners, 
spark.sql.catalogImplementation, spark.sql.subquery.maxThreadThreshold, 
spark.sql.globalTempDatabase, spark.sql.codegen.cache.maxEntries, 
spark.sql.filesourceTableRelationCacheSize, 
spark.sql.streaming.streamingQueryListeners, spark.sql.ui.retainedExecutions, 
spark.sql.hive.thriftServer.singleSession, spark.sql.extensions, 
spark.sql.debug, spark.sql.sources.schemaStringLengthThreshold, 
spark.sql.warehouse.dir] 

So it is might not a bug.)

> SparkSessionBuilder shoud not propagate static sql configurations to the 
> existing active/default SparkSession
> -
>
> Key: SPARK-31532
> URL: https://issues.apache.org/jira/browse/SPARK-31532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Clearly, this is a bug.
> {code:java}
> scala> spark.sql("set spark.sql.warehouse.dir").show
> +++
> | key|   value|
> +++
> |spark.sql.warehou...|file:/Users/kenty...|
> +++
> scala> spark.sql("set spark.sql.warehouse.dir=2");
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.sql.warehouse.dir;
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>   ... 47 elided
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
> getClass   getOrCreate
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", 
> "xyz").getOrCreate
> 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; 
> some configuration may not take effect.
> res7: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@6403d574
> scala> spark.sql("set spark.sql.warehouse.dir").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.warehou...|  xyz|
> ++-+
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30724) Support 'like any' and 'like all' operators

2020-04-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30724.
--
Fix Version/s: 3.1.0
 Assignee: Yuming Wang
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/27477

> Support 'like any' and 'like all' operators
> ---
>
> Key: SPARK-30724
> URL: https://issues.apache.org/jira/browse/SPARK-30724
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> In Teradata/Hive and PostgreSQL 'like any' and 'like all' operators are 
> mostly used when we are matching a text field with numbers of patterns. For 
> example:
> Teradata / Hive 3.0:
> {code:sql}
> --like any
> select 'foo' LIKE ANY ('%foo%','%bar%');
> --like all
> select 'foo' LIKE ALL ('%foo%','%bar%');
> {code}
> PostgreSQL:
> {code:sql}
> -- like any
> select 'foo' LIKE ANY (array['%foo%','%bar%']);
> -- like all
> select 'foo' LIKE ALL (array['%foo%','%bar%']);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-24 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091490#comment-17091490
 ] 

Maxim Gekk commented on SPARK-31553:


I am working on the issue

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-24 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31553:
--

 Summary: Wrong result of isInCollection for large collections
 Key: SPARK-31553
 URL: https://issues.apache.org/jira/browse/SPARK-31553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


If the size of a collection passed to isInCollection is bigger than 
spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
results for some inputs. For example:
{code:scala}
val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
println(set.contains("1"))
data.select($"x".isInCollection(set).as("isInCollection")).show()
{code}
{code}
true
+--+
|isInCollection|
+--+
| false|
+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-24 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091389#comment-17091389
 ] 

Maxim Gekk commented on SPARK-31463:


Parsing itself takes 10-20%. JSON datasource spends significant time in 
conversions to desired types according to schema. Even if you improve 
performance of parsing by a few times, the total impact will be not so 
significant.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-24 Thread Kent Yao (Jira)
Kent Yao created SPARK-31552:


 Summary: Fix potential ClassCastException in ScalaReflection 
arrayClassFor
 Key: SPARK-31552
 URL: https://issues.apache.org/jira/browse/SPARK-31552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
the cases in dataTypeFor are not fully handled in arrayClassFor

For example:


{code:java}
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be 
cast to org.apache.spark.sql.types.ObjectType
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(:57)
  ... 53 elided

scala>
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31502) document identifier in SQL Reference

2020-04-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31502:
---

Assignee: Huaxin Gao

> document identifier in SQL Reference
> 
>
> Key: SPARK-31502
> URL: https://issues.apache.org/jira/browse/SPARK-31502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> document identifier in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31502) document identifier in SQL Reference

2020-04-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31502.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28277
[https://github.com/apache/spark/pull/28277]

> document identifier in SQL Reference
> 
>
> Key: SPARK-31502
> URL: https://issues.apache.org/jira/browse/SPARK-31502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> document identifier in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-24 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31449:
---
Summary: Investigate the difference between JDK and Spark's time zone 
offset calculation  (was: Is there a difference between JDK and Spark's time 
zone offset calculation)

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-24 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31449:
---
Issue Type: Improvement  (was: Question)

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31535) Fix nested CTE substitution

2020-04-24 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091299#comment-17091299
 ] 

Peter Toth commented on SPARK-31535:


Hmm, for some reason my PR ([https://github.com/apache/spark/pull/28318]) 
didn't get linked to this ticket automatically.

> Fix nested CTE substitution
> ---
>
> Key: SPARK-31535
> URL: https://issues.apache.org/jira/browse/SPARK-31535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Blocker
>  Labels: correctness
>
> The following nested CTE should result empty result instead of {{1}}
> {noformat}
> WITH t(c) AS (SELECT 1)
> SELECT * FROM t
> WHERE c IN (
>   WITH t(c) AS (SELECT 2)
>   SELECT * FROM t
> )
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials

2020-04-24 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *SPARK_USER* only gets the 
UserGroupInformation.getCurrentUser().getShortUserName() of the user, which may 
lost the user's fully qualified user name. We should better use the 
*getUserName* to get fully qualified user name in our client side, which is 
aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *SPARK_USER* only returns the getShortUserName of the 
user, which may lost the user's fully qualified user name that need to be 
passed to PRC server (such as YARN, HDFS, Kafka). We should better use the 
*getUserName* to get fully qualified user name in our client side, which is 
aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051


> createSparkUser lost user's non-Hadoop credentials
> --
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> 

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-24 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *SPARK_USER* only returns the getShortUserName of the 
user, which may lost the user's fully qualified user name that need to be 
passed to PRC server (such as YARN, HDFS, Kafka). We should better use the 
*getUserName* to get fully qualified user name in our client side, which is 
aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side, which is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> 

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials

2020-04-24 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Summary: createSparkUser lost user's non-Hadoop credentials  (was: 
createSparkUser lost user's non-Hadoop credentials and fully qualified user 
name)

> createSparkUser lost user's non-Hadoop credentials
> --
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
> {code:java}
>    def createSparkUser(): UserGroupInformation = {
> val user = Utils.getCurrentUserName()
> logDebug("creating UGI for user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user)
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
> ugi
>   }
>   def transferCredentials(source: UserGroupInformation, dest: 
> UserGroupInformation): Unit = {
> dest.addCredentials(source.getCredentials())
>   }
>   def getCurrentUserName(): String = {
> Option(System.getenv("SPARK_USER"))
>   .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
>   }
> {code}
> The *transferCredentials* func can only transfer Hadoop creds such as 
> Delegation Tokens.
>  However, other creds stored in UGI.subject.getPrivateCredentials, will be 
> lost here, such as:
>  # Non-Hadoop creds:
>  Such as, [Kafka creds 
> |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
>  # Newly supported or 3rd party supported Hadoop creds:
>  Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
> OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
> are not supposed to be managed by Hadoop Credentials (currently it is only 
> for Hadoop secret keys and delegation tokens)
> Another issue is that the *SPARK_USER* only returns the getShortUserName of 
> the user, which may lost the user's fully qualified user name that need to be 
> passed to PRC server (such as YARN, HDFS, Kafka). We should better use the 
> *getUserName* to get fully qualified user name in our client side, which is 
> aligned to 
> *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.
> Related to https://issues.apache.org/jira/browse/SPARK-1051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-24 Thread Shashanka Balakuntala Srinivasa (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091232#comment-17091232
 ] 

Shashanka Balakuntala Srinivasa commented on SPARK-31463:
-

Hi [~hyukjin.kwon], I will start looking into this. Thanks.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091230#comment-17091230
 ] 

Hyukjin Kwon commented on SPARK-31463:
--

Separate source might be ideal. We can start it from separate project and 
gradually move it into Apache Spark when it's proven very useful later.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31438) Support JobCleaned Status in SparkListener

2020-04-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091225#comment-17091225
 ] 

Hyukjin Kwon commented on SPARK-31438:
--

PR https://github.com/apache/spark/pull/28280

> Support JobCleaned Status in SparkListener
> --
>
> Key: SPARK-31438
> URL: https://issues.apache.org/jira/browse/SPARK-31438
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> In Spark, we need do some hook after job cleaned, such as cleaning hive 
> external temporary paths. This has already discussed in SPARK-31346 and 
> [GitHub Pull Request #28129.|https://github.com/apache/spark/pull/28129]
>  The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
> finished, once all result has generated, it should be finished. After finish, 
> Scheduler will leave the still running tasks to be zombie tasks and delete 
> abnormal tasks asynchronously.
>  Thus, we add JobCleaned Status to enable user to do some hook after all 
> tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, 
> which is related to a stage, and once all stages of the job has been cleaned, 
> then the job is cleaned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31453) Error while converting JavaRDD to Dataframe

2020-04-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31453.
--
Resolution: Duplicate

It duplicates SPARK-23862. See SPARK-21255 for the workaround


> Error while converting JavaRDD to Dataframe
> ---
>
> Key: SPARK-31453
> URL: https://issues.apache.org/jira/browse/SPARK-31453
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Sachit Sharma
>Priority: Trivial
>
> Please refer to this: 
> [https://stackoverflow.com/questions/61172007/error-while-converting-javardd-to-dataframe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org