[jira] [Assigned] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28285: Assignee: Huaxin Gao (was: Apache Spark) > Convert and port 'outer-join.sql' into UDF test base > > > Key: SPARK-28285 > URL: https://issues.apache.org/jira/browse/SPARK-28285 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28277) Convert and port 'except.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28277. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25101 [https://github.com/apache/spark/pull/25101] > Convert and port 'except.sql' into UDF test base > > > Key: SPARK-28277 > URL: https://issues.apache.org/jira/browse/SPARK-28277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28277) Convert and port 'except.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28277: Assignee: Huaxin Gao > Convert and port 'except.sql' into UDF test base > > > Key: SPARK-28277 > URL: https://issues.apache.org/jira/browse/SPARK-28277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
[ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888495#comment-16888495 ] Mathew Wicks commented on SPARK-24632: -- I have an elegant solution for this: You can include a separate Python package which mirrors the class address for the java objects you wrap. For example, in the PySpark API for XGBoost I did created the following package for objects under *ml.dmlc.xgboost4j.scala.spark._* {code:java} ml/__init__.py ml/dmlc/__init__.py ml/dmlc/xgboost4j/__init__.py ml/dmlc/xgboost4j/scala/__init__.py ml/dmlc/xgboost4j/scala/spark/__init__.py {code} With all __init__.py empty except the final one, which contained: {code:java} import sys from sparkxgb import xgboost # Allows Pipeline()/PipelineModel() with XGBoost stages to be loaded from disk. # Needed because they try to import Python objects from their Java location. sys.modules['ml.dmlc.xgboost4j.scala.spark'] = xgboost {code} Where my actual Python wrapper classes are under *sparkxgb.xgboost*. This works because PySpark will try import from the Java address of the class, even though it's in Python. For more context: can find [the initial PR here|https://github.com/dmlc/xgboost/pull/4656]. > Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers > for persistence > -- > > Key: SPARK-24632 > URL: https://issues.apache.org/jira/browse/SPARK-24632 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Joseph K. Bradley >Priority: Major > > This is a follow-up for [SPARK-17025], which allowed users to implement > Python PipelineStages in 3rd-party libraries, include them in Pipelines, and > use Pipeline persistence. This task is to make it easier for 3rd-party > libraries to have PipelineStages written in Java and then to use pyspark.ml > abstractions to create wrappers around those Java classes. This is currently > possible, except that users hit bugs around persistence. > I spent a bit thinking about this and wrote up thoughts and a proposal in the > doc linked below. Summary of proposal: > Require that 3rd-party libraries with Java classes with Python wrappers > implement a trait which provides the corresponding Python classpath in some > field: > {code} > trait PythonWrappable { > def pythonClassPath: String = … > } > MyJavaType extends PythonWrappable > {code} > This will not be required for MLlib wrappers, which we can handle specially. > One issue for this task will be that we may have trouble writing unit tests. > They would ideally test a Java class + Python wrapper class pair sitting > outside of pyspark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28436) Throw better exception when datasource's schema is not equal to user-specific shema
[ https://issues.apache.org/jira/browse/SPARK-28436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28436: - Summary: Throw better exception when datasource's schema is not equal to user-specific shema (was: [SQL] Throw better exception when datasource's schema is not equal to user-specific shema) > Throw better exception when datasource's schema is not equal to user-specific > shema > --- > > Key: SPARK-28436 > URL: https://issues.apache.org/jira/browse/SPARK-28436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.3 >Reporter: ShuMing Li >Priority: Minor > > When this exception is thrown, users cannot find what's the difference > between datasource's original schema and user-specific schema, and maybe very > confused when meet the exception below. > {code:java} > org.apache.spark.sql.AnalysisException: org.apache.spark.odps.datasource does > not allow user-specified schemas. > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3270) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:83) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3269) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:653) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:714) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28438) Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema
[ https://issues.apache.org/jira/browse/SPARK-28438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28438. -- Resolution: Duplicate > Ignore metadata's(comments) difference when comparing datasource's schema and > user-specific schema > -- > > Key: SPARK-28438 > URL: https://issues.apache.org/jira/browse/SPARK-28438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: ShuMing Li >Priority: Minor > > When users register a datasource table to Spark, Spark only support complete > schema equality of datasource's origin schema and user-specific's schema now. > However datasource's origin schema may be little different with > user-specific's schema: the diff maybe `column's comment` or other metadata > info. > Can we ignore column's comment or metadata info when comparing? > {code:java} > // DataSource.scala > case (dataSource: RelationProvider, Some(schema)) => > val baseRelation = > dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) > if (baseRelation.schema != schema) { > throw new AnalysisException(s"$className does not allow user-specified > schemas, " + > s"source schema: ${baseRelation.schema}, user-specific schema: ${schema}") > } > // StructType.scala > override def equals(that: Any): Boolean = { > that match > { case StructType(otherFields) => java.util.Arrays.equals( > fields.asInstanceOf[Array[AnyRef]], otherFields.asInstanceOf[Array[AnyRef]]) > case _ => false } > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28438) Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema
[ https://issues.apache.org/jira/browse/SPARK-28438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28438: - Summary: Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema (was: [SQL] Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema) > Ignore metadata's(comments) difference when comparing datasource's schema and > user-specific schema > -- > > Key: SPARK-28438 > URL: https://issues.apache.org/jira/browse/SPARK-28438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: ShuMing Li >Priority: Minor > > When users register a datasource table to Spark, Spark only support complete > schema equality of datasource's origin schema and user-specific's schema now. > However datasource's origin schema may be little different with > user-specific's schema: the diff maybe `column's comment` or other metadata > info. > Can we ignore column's comment or metadata info when comparing? > {code:java} > // DataSource.scala > case (dataSource: RelationProvider, Some(schema)) => > val baseRelation = > dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) > if (baseRelation.schema != schema) { > throw new AnalysisException(s"$className does not allow user-specified > schemas, " + > s"source schema: ${baseRelation.schema}, user-specific schema: ${schema}") > } > // StructType.scala > override def equals(that: Any): Boolean = { > that match > { case StructType(otherFields) => java.util.Arrays.equals( > fields.asInstanceOf[Array[AnyRef]], otherFields.asInstanceOf[Array[AnyRef]]) > case _ => false } > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28442) Potentially persist data without leadership
[ https://issues.apache.org/jira/browse/SPARK-28442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888478#comment-16888478 ] Hyukjin Kwon commented on SPARK-28442: -- [~Tison], please provide reproducer and/or console print out if you faced this problem. It's difficult to see what's an issue without reading the current JIRA. > Potentially persist data without leadership > --- > > Key: SPARK-28442 > URL: https://issues.apache.org/jira/browse/SPARK-28442 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.3 >Reporter: TisonKun >Priority: Major > > Spark Master could potentially persist data via > {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the > execution order below. > 1. master-1 became the leader. > 2. master-1 received message and wanted to addApplication(or addWorker) > 3. master-1 stuck because of a full gc > 4. master-1 lost leadership on zk. master-2 became the leader. > 5. master-1 received {{RevokedLeadership}} message but the message was > pending. > 6. master-1 finished persisting data. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28285. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25103 [https://github.com/apache/spark/pull/25103] > Convert and port 'outer-join.sql' into UDF test base > > > Key: SPARK-28285 > URL: https://issues.apache.org/jira/browse/SPARK-28285 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28443) Spark sql add exception when create field type NullType
[ https://issues.apache.org/jira/browse/SPARK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-28443: Environment: (was: ) > Spark sql add exception when create field type NullType > > > Key: SPARK-28443 > URL: https://issues.apache.org/jira/browse/SPARK-28443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: ulysses you >Priority: Major > > [28313|https://issues.apache.org/jira/browse/SPARK-28313] > 28313 change a behavior that > `Add rule to throw exception when catalog.create NullType StructField` > So this pr is to discuss details > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28443) Spark sql add exception when create field type NullType
[ https://issues.apache.org/jira/browse/SPARK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-28443: Description: [28313|https://issues.apache.org/jira/browse/SPARK-28313] 28313 change a behavior that `Add rule to throw exception when catalog.create NullType StructField` So this pr is to discuss details was: 28313 28313 change a behavior that `Add rule to throw exception when catalog.create NullType StructField` So this pr is to discuss details > Spark sql add exception when create field type NullType > > > Key: SPARK-28443 > URL: https://issues.apache.org/jira/browse/SPARK-28443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: > >Reporter: ulysses you >Priority: Major > > [28313|https://issues.apache.org/jira/browse/SPARK-28313] > 28313 change a behavior that > `Add rule to throw exception when catalog.create NullType StructField` > So this pr is to discuss details > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28443) Spark sql add exception when create field type NullType
[ https://issues.apache.org/jira/browse/SPARK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-28443: Environment: [28313|https://issues.apache.org/jira/browse/SPARK-28313] 28313 change a behavior that `Add rule to throw exception when catalog.create NullType StructField` So this pr is to discuss details was: 28313 28313 change a behavior that `Add rule to throw exception when catalog.create NullType StructField` So this pr is to discuss details > Spark sql add exception when create field type NullType > > > Key: SPARK-28443 > URL: https://issues.apache.org/jira/browse/SPARK-28443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: [28313|https://issues.apache.org/jira/browse/SPARK-28313] > 28313 change a behavior that > `Add rule to throw exception when catalog.create NullType StructField` > So this pr is to discuss details > >Reporter: ulysses you >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28443) Spark sql add exception when create field type NullType
[ https://issues.apache.org/jira/browse/SPARK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-28443: Description: 28313 28313 change a behavior that `Add rule to throw exception when catalog.create NullType StructField` So this pr is to discuss details > Spark sql add exception when create field type NullType > > > Key: SPARK-28443 > URL: https://issues.apache.org/jira/browse/SPARK-28443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: > >Reporter: ulysses you >Priority: Major > > 28313 > 28313 change a behavior that > `Add rule to throw exception when catalog.create NullType StructField` > So this pr is to discuss details -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28443) Spark sql add exception when create field type NullType
[ https://issues.apache.org/jira/browse/SPARK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-28443: Environment: was: [28313|https://issues.apache.org/jira/browse/SPARK-28313] 28313 change a behavior that `Add rule to throw exception when catalog.create NullType StructField` So this pr is to discuss details > Spark sql add exception when create field type NullType > > > Key: SPARK-28443 > URL: https://issues.apache.org/jira/browse/SPARK-28443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: > >Reporter: ulysses you >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28443) Spark sql add exception when create field type NullType
ulysses you created SPARK-28443: --- Summary: Spark sql add exception when create field type NullType Key: SPARK-28443 URL: https://issues.apache.org/jira/browse/SPARK-28443 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Environment: [28313|https://issues.apache.org/jira/browse/SPARK-28313] 28313 change a behavior when create table use NullType, so this pr is to discuss details Reporter: ulysses you -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28443) Spark sql add exception when create field type NullType
[ https://issues.apache.org/jira/browse/SPARK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-28443: Environment: 28313 28313 change a behavior that `Add rule to throw exception when catalog.create NullType StructField` So this pr is to discuss details was: [28313|https://issues.apache.org/jira/browse/SPARK-28313] 28313 change a behavior when create table use NullType, so this pr is to discuss details > Spark sql add exception when create field type NullType > > > Key: SPARK-28443 > URL: https://issues.apache.org/jira/browse/SPARK-28443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: 28313 > 28313 change a behavior that > `Add rule to throw exception when catalog.create NullType StructField` > So this pr is to discuss details >Reporter: ulysses you >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28287) Convert and port 'udaf.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28287. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25194 [https://github.com/apache/spark/pull/25194] > Convert and port 'udaf.sql' into UDF test base > -- > > Key: SPARK-28287 > URL: https://issues.apache.org/jira/browse/SPARK-28287 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Vinod KC >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28287) Convert and port 'udaf.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28287: Assignee: Vinod KC > Convert and port 'udaf.sql' into UDF test base > -- > > Key: SPARK-28287 > URL: https://issues.apache.org/jira/browse/SPARK-28287 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Vinod KC >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28442) Potentially persist data without leadership
[ https://issues.apache.org/jira/browse/SPARK-28442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TisonKun updated SPARK-28442: - Description: Spark Master could potentially persist data via {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the execution order below. 1. master-1 became the leader. 2. master-1 received message and wanted to addApplication(or addWorker) 3. master-1 stuck because of a full gc 4. master-1 lost leadership on zk. master-2 became the leader. 5. master-1 received {{RevokedLeadership}} message but the message was pending. 6. master-1 finished persisting data. was: Spark Master could potentially persist data via {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the execution order below. 1. master-1 became the leader. 2. master-1 received message and wanted to addApplication(or addWorker) 3. master-1 stuck because of a full gc 4. master-1 lost leadership on zk. master-2 became the leader. master-1 received {{RevokedLeadership}} message but the message was pending. 5. master-1 finished persisting data. > Potentially persist data without leadership > --- > > Key: SPARK-28442 > URL: https://issues.apache.org/jira/browse/SPARK-28442 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.3 >Reporter: TisonKun >Priority: Major > > Spark Master could potentially persist data via > {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the > execution order below. > 1. master-1 became the leader. > 2. master-1 received message and wanted to addApplication(or addWorker) > 3. master-1 stuck because of a full gc > 4. master-1 lost leadership on zk. master-2 became the leader. > 5. master-1 received {{RevokedLeadership}} message but the message was > pending. > 6. master-1 finished persisting data. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28442) Potentially persist data without leadership
[ https://issues.apache.org/jira/browse/SPARK-28442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TisonKun updated SPARK-28442: - Description: Spark Master could potentially persist data via {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the execution order below. 1. master-1 became the leader. 2. master-1 received message and wanted to addApplication(or addWorker) 3. master-1 stuck because of a full gc 4. master-1 lost leadership on zk. master-2 became the leader. master-1 received {{RevokedLeadership}} message but the message was pending. 5. master-1 finished persisting data. was: Spark Master could potentially persist data via {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the execution order below. 1. master-1 became the leader. 2. master-1 received message and wanted to addApplication(or addWorker) 3. master-1 stuck because of a full gc 4. master-1 lost leadership on zk, received {{RevokedLeadership}} message but it was pending. 5. master-1 finished persisting data. > Potentially persist data without leadership > --- > > Key: SPARK-28442 > URL: https://issues.apache.org/jira/browse/SPARK-28442 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.3 >Reporter: TisonKun >Priority: Major > > Spark Master could potentially persist data via > {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the > execution order below. > 1. master-1 became the leader. > 2. master-1 received message and wanted to addApplication(or addWorker) > 3. master-1 stuck because of a full gc > 4. master-1 lost leadership on zk. master-2 became the leader. master-1 > received {{RevokedLeadership}} message but the message was pending. > 5. master-1 finished persisting data. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28442) Potentially persist data without leadership
[ https://issues.apache.org/jira/browse/SPARK-28442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TisonKun updated SPARK-28442: - Component/s: (was: Documentation) Deploy > Potentially persist data without leadership > --- > > Key: SPARK-28442 > URL: https://issues.apache.org/jira/browse/SPARK-28442 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.3 >Reporter: TisonKun >Priority: Major > > Spark Master could potentially persist data via > {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the > execution order below. > 1. master-1 became the leader. > 2. master-1 received message and wanted to addApplication(or addWorker) > 3. master-1 stuck because of a full gc > 4. master-1 lost leadership on zk, received {{RevokedLeadership}} message but > it was pending. > 5. master-1 finished persisting data. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28442) Potentially persist data without leadership
TisonKun created SPARK-28442: Summary: Potentially persist data without leadership Key: SPARK-28442 URL: https://issues.apache.org/jira/browse/SPARK-28442 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.4.3 Reporter: TisonKun Spark Master could potentially persist data via {{ZooKeeperPersistenceEngine}} even if it is not the leader. See the execution order below. 1. master-1 became the leader. 2. master-1 received message and wanted to addApplication(or addWorker) 3. master-1 stuck because of a full gc 4. master-1 lost leadership on zk, received {{RevokedLeadership}} message but it was pending. 5. master-1 finished persisting data. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28441) udf(max(udf(column))) throws java.lang.UnsupportedOperationException: Cannot evaluate expression: udf(null)
Huaxin Gao created SPARK-28441: -- Summary: udf(max(udf(column))) throws java.lang.UnsupportedOperationException: Cannot evaluate expression: udf(null) Key: SPARK-28441 URL: https://issues.apache.org/jira/browse/SPARK-28441 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao I found this when doing https://issues.apache.org/jira/browse/SPARK-28277 {code:java} >>> @pandas_udf("string", PandasUDFType.SCALAR) ... def noop(x): ... return x.apply(str) ... >>> spark.udf.register("udf", noop) >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)") DataFrame[] >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as >>> t2(k, v)") DataFrame[] >>> spark.sql("SELECT t1.k FROM t1 WHERE t1.v <= (SELECT udf(max(udf(t2.v))) >>> FROM t2 WHERE udf(t2.k) = udf(t1.k))").show() py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString. : java.lang.UnsupportedOperationException: Cannot evaluate expression: udf(null) at org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296) at org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295) at org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28416) Use java.time API in timestampAddInterval
[ https://issues.apache.org/jira/browse/SPARK-28416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28416. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25173 [https://github.com/apache/spark/pull/25173] > Use java.time API in timestampAddInterval > - > > Key: SPARK-28416 > URL: https://issues.apache.org/jira/browse/SPARK-28416 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Implement the timestampAddInterval method of DateTimeUtils by using the > plusMonths() and plus() method of ZonedDateTime of Java 8 time API. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28416) Use java.time API in timestampAddInterval
[ https://issues.apache.org/jira/browse/SPARK-28416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-28416: - Assignee: Maxim Gekk > Use java.time API in timestampAddInterval > - > > Key: SPARK-28416 > URL: https://issues.apache.org/jira/browse/SPARK-28416 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Implement the timestampAddInterval method of DateTimeUtils by using the > plusMonths() and plus() method of ZonedDateTime of Java 8 time API. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28312) Add numeric.sql
[ https://issues.apache.org/jira/browse/SPARK-28312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28312: - Assignee: Yuming Wang > Add numeric.sql > --- > > Key: SPARK-28312 > URL: https://issues.apache.org/jira/browse/SPARK-28312 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28312) Add numeric.sql
[ https://issues.apache.org/jira/browse/SPARK-28312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28312. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25092 [https://github.com/apache/spark/pull/25092] > Add numeric.sql > --- > > Key: SPARK-28312 > URL: https://issues.apache.org/jira/browse/SPARK-28312 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28432) Date/Time Functions: make_date/make_timestamp
[ https://issues.apache.org/jira/browse/SPARK-28432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888324#comment-16888324 ] Maxim Gekk commented on SPARK-28432: I am working on this > Date/Time Functions: make_date/make_timestamp > - > > Key: SPARK-28432 > URL: https://issues.apache.org/jira/browse/SPARK-28432 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ > }}{{int}}{{)}}|{{date}}|Create date from year, month and day > fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}| > |{{make_timestamp(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ }}{{int}}{{, > _hour_ }}{{int}}{{, _min_ }}{{int}}{{, _sec_}}{{double > precision}}{{)}}|{{timestamp}}|Create timestamp from year, month, day, hour, > minute and seconds fields|{{make_timestamp(2013, 7, 15, 8, 15, > 23.5)}}|{{2013-07-15 08:15:23.5}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics
[ https://issues.apache.org/jira/browse/SPARK-28430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28430. --- Resolution: Fixed Fix Version/s: 2.4.4 2.3.4 3.0.0 Issue resolved by pull request 25183 [https://github.com/apache/spark/pull/25183] > Some stage table rows render wrong number of columns if tasks are missing > metrics > -- > > Key: SPARK-28430 > URL: https://issues.apache.org/jira/browse/SPARK-28430 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.0.0, 2.3.4, 2.4.4 > > Attachments: ui-screenshot.png > > > The Spark UI's stages table renders too few columns for some tasks if a > subset of the tasks are missing their metrics. This is due to an > inconsistency in how we render certain columns: some columns gracefully > handle this case, but others do not. See attached screenshot below > !ui-screenshot.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28439) pyspark.sql.functions.array_repeat should support Column as count argument
[ https://issues.apache.org/jira/browse/SPARK-28439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28439: - Assignee: Maciej Szymkiewicz > pyspark.sql.functions.array_repeat should support Column as count argument > -- > > Key: SPARK-28439 > URL: https://issues.apache.org/jira/browse/SPARK-28439 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > > In Scala, Spark supports > (https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3777) > > {code:java} > (Column, Column) => Column > {code} > variant of array_repeat, however PySpark doesn't > {code:java} > >>> import pyspark > >>> from pyspark.sql import functions as f > >>> pyspark.__version__ > '3.0.0.dev0' > > >>> f.array_repeat(f.col("foo"), f.col("bar")) > ... > TypeError: Column is not iterable > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28439) pyspark.sql.functions.array_repeat should support Column as count argument
[ https://issues.apache.org/jira/browse/SPARK-28439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28439. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25193 [https://github.com/apache/spark/pull/25193] > pyspark.sql.functions.array_repeat should support Column as count argument > -- > > Key: SPARK-28439 > URL: https://issues.apache.org/jira/browse/SPARK-28439 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.0.0 > > > In Scala, Spark supports > (https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3777) > > {code:java} > (Column, Column) => Column > {code} > variant of array_repeat, however PySpark doesn't > {code:java} > >>> import pyspark > >>> from pyspark.sql import functions as f > >>> pyspark.__version__ > '3.0.0.dev0' > > >>> f.array_repeat(f.col("foo"), f.col("bar")) > ... > TypeError: Column is not iterable > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28427) Support more Postgres JSON functions
[ https://issues.apache.org/jira/browse/SPARK-28427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888258#comment-16888258 ] Maxim Gekk commented on SPARK-28427: Probably, we can switch the flag spark.sql.legacy.sizeOfNull, or even remove it in Spark 3.0? > Support more Postgres JSON functions > > > Key: SPARK-28427 > URL: https://issues.apache.org/jira/browse/SPARK-28427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > Postgres features a number of JSON functions that are missing in Spark: > https://www.postgresql.org/docs/9.3/functions-json.html > Redshift's JSON functions > (https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html) have > partial overlap with the Postgres list. > Some of these functions can be expressed in terms of compositions of existing > Spark functions. For example, I think that {{json_array_length}} can be > expressed with {{cardinality}} and {{from_json}}, but there's a caveat > related to legacy Hive compatibility (see the demo notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5796212617691211/45530874214710/4901752417050771/latest.html > for more details). > I'm filing this ticket so that we can triage the list of Postgres JSON > features and decide which ones make sense to support in Spark. After we've > done that, we can create individual tickets for specific functions and > features. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28382) Array Functions: unnest
[ https://issues.apache.org/jira/browse/SPARK-28382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888248#comment-16888248 ] Maxim Gekk commented on SPARK-28382: Is it just explode()? > Array Functions: unnest > --- > > Key: SPARK-28382 > URL: https://issues.apache.org/jira/browse/SPARK-28382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{unnest}}({{anyarray}})|set of anyelement|expand an array to a set of > rows|unnest(ARRAY[1,2])|1 > 2 > (2 rows)| > > https://www.postgresql.org/docs/11/functions-array.html > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888246#comment-16888246 ] Alexander Tronchin-James commented on SPARK-14543: -- OK, thanks! > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28440) Use TestingUtils to compare floating point values
[ https://issues.apache.org/jira/browse/SPARK-28440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28440: -- Component/s: (was: ML) MLlib > Use TestingUtils to compare floating point values > - > > Key: SPARK-28440 > URL: https://issues.apache.org/jira/browse/SPARK-28440 > Project: Spark > Issue Type: Improvement > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28440) Use TestingUtils to compare floating point values
Dongjoon Hyun created SPARK-28440: - Summary: Use TestingUtils to compare floating point values Key: SPARK-28440 URL: https://issues.apache.org/jira/browse/SPARK-28440 Project: Spark Issue Type: Improvement Components: ML, Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888222#comment-16888222 ] Ryan Blue commented on SPARK-14543: --- {{byName}} was never added to Apache Spark. The change was rejected, so it is only available in Netflix's Spark branch. I resolved this with "later" because we are including by-name resolution in the DSv2 work. The replacement for {{DataFrameWriter}} will default to name-based resolution. > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888146#comment-16888146 ] Wenchen Fan commented on SPARK-14948: - There is an ongoing effort to detect this case and fail instead of fixing it: https://github.com/apache/spark/pull/25107 > Exception when joining DataFrames derived form the same DataFrame > - > > Key: SPARK-14948 > URL: https://issues.apache.org/jira/browse/SPARK-14948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Saurabh Santhosh >Priority: Major > > h2. Spark Analyser is throwing the following exception in a specific scenario > : > h2. Exception : > org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing > from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > JavaRDD rdd = > > sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", > "b"))); > DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new > StructType(fields)); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as > asd, F2 from t1"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, > "t2"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3"); > > DataFrame join = aliasedDf.join(df, > aliasedDf.col("F2").equalTo(df.col("F2")), "inner"); > DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1")); > select.collect(); > {code} > h2. Observations : > * This issue is related to the Data Type of Fields of the initial Data > Frame.(If the Data Type is not String, it will work.) > * It works fine if the data frame is registered as a temporary table and an > sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27416) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888130#comment-16888130 ] Wenchen Fan commented on SPARK-27416: - yea let's backport it! > UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines > have different Oops size > > > Key: SPARK-27416 > URL: https://issues.apache.org/jira/browse/SPARK-27416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Assignee: peng bo >Priority: Major > Fix For: 3.0.0 > > > Actually this's follow up for > https://issues.apache.org/jira/browse/SPARK-27406, > https://issues.apache.org/jira/browse/SPARK-10914 > This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization > issue when two machines have different Oops size. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28388) Port select_implicit.sql
[ https://issues.apache.org/jira/browse/SPARK-28388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28388. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25152 [https://github.com/apache/spark/pull/25152] > Port select_implicit.sql > > > Key: SPARK-28388 > URL: https://issues.apache.org/jira/browse/SPARK-28388 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/select_implicit.sql. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28388) Port select_implicit.sql
[ https://issues.apache.org/jira/browse/SPARK-28388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28388: - Assignee: Yuming Wang > Port select_implicit.sql > > > Key: SPARK-28388 > URL: https://issues.apache.org/jira/browse/SPARK-28388 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/select_implicit.sql. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28138) Add timestamp.sql
[ https://issues.apache.org/jira/browse/SPARK-28138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28138: - Assignee: Yuming Wang > Add timestamp.sql > - > > Key: SPARK-28138 > URL: https://issues.apache.org/jira/browse/SPARK-28138 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/timestamp.sql]. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28138) Add timestamp.sql
[ https://issues.apache.org/jira/browse/SPARK-28138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28138. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25181 [https://github.com/apache/spark/pull/25181] > Add timestamp.sql > - > > Key: SPARK-28138 > URL: https://issues.apache.org/jira/browse/SPARK-28138 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/timestamp.sql]. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28286) Convert and port 'pivot.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28286. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25122 [https://github.com/apache/spark/pull/25122] > Convert and port 'pivot.sql' into UDF test base > --- > > Key: SPARK-28286 > URL: https://issues.apache.org/jira/browse/SPARK-28286 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Chitral Verma >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28286) Convert and port 'pivot.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28286: Assignee: Chitral Verma > Convert and port 'pivot.sql' into UDF test base > --- > > Key: SPARK-28286 > URL: https://issues.apache.org/jira/browse/SPARK-28286 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Chitral Verma >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28438) [SQL] Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema
[ https://issues.apache.org/jira/browse/SPARK-28438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShuMing Li updated SPARK-28438: --- Description: When users register a datasource table to Spark, Spark only support complete schema equality of datasource's origin schema and user-specific's schema now. However datasource's origin schema may be little different with user-specific's schema: the diff maybe `column's comment` or other metadata info. Can we ignore column's comment or metadata info when comparing? // DataSource.scala case (dataSource: RelationProvider, Some(schema)) => val baseRelation = dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) if (baseRelation.schema != schema) \{ throw new AnalysisException(s"$className does not allow user-specified schemas, " + s"source schema: ${baseRelation.schema}, user-specific schema: ${schema}") } // StructType.scala override def equals(that: Any): Boolean = \{ that match { case StructType(otherFields) => java.util.Arrays.equals( fields.asInstanceOf[Array[AnyRef]], otherFields.asInstanceOf[Array[AnyRef]]) case _ => false } } > [SQL] Ignore metadata's(comments) difference when comparing datasource's > schema and user-specific schema > > > Key: SPARK-28438 > URL: https://issues.apache.org/jira/browse/SPARK-28438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: ShuMing Li >Priority: Minor > > When users register a datasource table to Spark, Spark only support complete > schema equality of datasource's origin schema and user-specific's schema now. > However datasource's origin schema may be little different with > user-specific's schema: the diff maybe `column's comment` or other metadata > info. > Can we ignore column's comment or metadata info when comparing? > // DataSource.scala > case (dataSource: RelationProvider, Some(schema)) => > val baseRelation = > dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) > if (baseRelation.schema != schema) \{ > throw new AnalysisException(s"$className does not allow user-specified > schemas, " + > s"source schema: ${baseRelation.schema}, user-specific schema: > ${schema}") > } > // StructType.scala > override def equals(that: Any): Boolean = \{ > that match { > case StructType(otherFields) => > java.util.Arrays.equals( > fields.asInstanceOf[Array[AnyRef]], > otherFields.asInstanceOf[Array[AnyRef]]) > case _ => false > } > } -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28438) [SQL] Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema
[ https://issues.apache.org/jira/browse/SPARK-28438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShuMing Li updated SPARK-28438: --- Description: When users register a datasource table to Spark, Spark only support complete schema equality of datasource's origin schema and user-specific's schema now. However datasource's origin schema may be little different with user-specific's schema: the diff maybe `column's comment` or other metadata info. Can we ignore column's comment or metadata info when comparing? {code:java} // DataSource.scala case (dataSource: RelationProvider, Some(schema)) => val baseRelation = dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) if (baseRelation.schema != schema) { throw new AnalysisException(s"$className does not allow user-specified schemas, " + s"source schema: ${baseRelation.schema}, user-specific schema: ${schema}") } // StructType.scala override def equals(that: Any): Boolean = { that match { case StructType(otherFields) => java.util.Arrays.equals( fields.asInstanceOf[Array[AnyRef]], otherFields.asInstanceOf[Array[AnyRef]]) case _ => false } } {code} was: When users register a datasource table to Spark, Spark only support complete schema equality of datasource's origin schema and user-specific's schema now. However datasource's origin schema may be little different with user-specific's schema: the diff maybe `column's comment` or other metadata info. Can we ignore column's comment or metadata info when comparing? // DataSource.scala case (dataSource: RelationProvider, Some(schema)) => val baseRelation = dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) if (baseRelation.schema != schema) \{ throw new AnalysisException(s"$className does not allow user-specified schemas, " + s"source schema: ${baseRelation.schema}, user-specific schema: ${schema}") } // StructType.scala override def equals(that: Any): Boolean = \{ that match { case StructType(otherFields) => java.util.Arrays.equals( fields.asInstanceOf[Array[AnyRef]], otherFields.asInstanceOf[Array[AnyRef]]) case _ => false } } > [SQL] Ignore metadata's(comments) difference when comparing datasource's > schema and user-specific schema > > > Key: SPARK-28438 > URL: https://issues.apache.org/jira/browse/SPARK-28438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: ShuMing Li >Priority: Minor > > When users register a datasource table to Spark, Spark only support complete > schema equality of datasource's origin schema and user-specific's schema now. > However datasource's origin schema may be little different with > user-specific's schema: the diff maybe `column's comment` or other metadata > info. > Can we ignore column's comment or metadata info when comparing? > {code:java} > // DataSource.scala > case (dataSource: RelationProvider, Some(schema)) => > val baseRelation = > dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) > if (baseRelation.schema != schema) { > throw new AnalysisException(s"$className does not allow user-specified > schemas, " + > s"source schema: ${baseRelation.schema}, user-specific schema: ${schema}") > } > // StructType.scala > override def equals(that: Any): Boolean = { > that match > { case StructType(otherFields) => java.util.Arrays.equals( > fields.asInstanceOf[Array[AnyRef]], otherFields.asInstanceOf[Array[AnyRef]]) > case _ => false } > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28439) pyspark.sql.functions.array_repeat should support Column as count argument
[ https://issues.apache.org/jira/browse/SPARK-28439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-28439: --- Description: In Scala, Spark supports (https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3777) {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> from pyspark.sql import functions as f >>> pyspark.__version__ '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) ... TypeError: Column is not iterable {code} was: In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> from pyspark.sql import functions as f >>> pyspark.__version__ '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) ... TypeError: Column is not iterable {code} > pyspark.sql.functions.array_repeat should support Column as count argument > -- > > Key: SPARK-28439 > URL: https://issues.apache.org/jira/browse/SPARK-28439 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > In Scala, Spark supports > (https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3777) > > {code:java} > (Column, Column) => Column > {code} > variant of array_repeat, however PySpark doesn't > {code:java} > >>> import pyspark > >>> from pyspark.sql import functions as f > >>> pyspark.__version__ > '3.0.0.dev0' > > >>> f.array_repeat(f.col("foo"), f.col("bar")) > ... > TypeError: Column is not iterable > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28439) pyspark.sql.functions.array_repeat should support Column as count argument
[ https://issues.apache.org/jira/browse/SPARK-28439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-28439: --- Description: In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> from pyspark.sql import functions as f >>> pyspark.__version__ '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) ... TypeError: Column is not iterable {code} was: In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> from pyspark.sql import functions as f >>> pyspark.__version__ '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) ... TypeError: Column is not iterable {code} > pyspark.sql.functions.array_repeat should support Column as count argument > -- > > Key: SPARK-28439 > URL: https://issues.apache.org/jira/browse/SPARK-28439 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > In Scala Spark supports > > {code:java} > (Column, Column) => Column > {code} > variant of array_repeat, however PySpark doesn't > {code:java} > >>> import pyspark > >>> from pyspark.sql import functions as f > >>> pyspark.__version__ > '3.0.0.dev0' > > >>> f.array_repeat(f.col("foo"), f.col("bar")) > ... > TypeError: Column is not iterable > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28439) pyspark.sql.functions.array_repeat should support Column as count argument
[ https://issues.apache.org/jira/browse/SPARK-28439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-28439: --- Description: In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> from pyspark.sql import functions as f >>> pyspark.__version__ '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) ... TypeError: Column is not iterable {code} was: In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> >>> >>> from pyspark.sql import functions as f >>> >>> >>> pyspark.__version__ >>> >>> '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) >>> >>> Traceback (most recent call last): ... TypeError: Column is not iterable {code} > pyspark.sql.functions.array_repeat should support Column as count argument > -- > > Key: SPARK-28439 > URL: https://issues.apache.org/jira/browse/SPARK-28439 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > In Scala Spark supports > > {code:java} > (Column, Column) => Column > {code} > variant of array_repeat, however PySpark doesn't > > > > {code:java} > >>> import pyspark > >>> from pyspark.sql import functions as f > >>> pyspark.__version__ > '3.0.0.dev0' > > >>> f.array_repeat(f.col("foo"), f.col("bar")) > ... > TypeError: Column is not iterable > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28439) pyspark.sql.functions.array_repeat should support Column as count argument
[ https://issues.apache.org/jira/browse/SPARK-28439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-28439: --- Description: In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> >>> >>> from pyspark.sql import functions as f >>> >>> >>> pyspark.__version__ >>> >>> '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) >>> >>> Traceback (most recent call last): ... TypeError: Column is not iterable {code} was: In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> >>> >>> from pyspark.sql import functions as f >>> >>> >>> pyspark.__version__ >>> >>> '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) >>> >>> Traceback (most recent call last): ... TypeError: Column is not iterable {code} > pyspark.sql.functions.array_repeat should support Column as count argument > -- > > Key: SPARK-28439 > URL: https://issues.apache.org/jira/browse/SPARK-28439 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > In Scala Spark supports > > {code:java} > (Column, Column) => Column > {code} > variant of array_repeat, however PySpark doesn't > > > > {code:java} > >>> import pyspark > >>> > >>> > >>> from pyspark.sql import functions as f > >>> > >>> > >>> pyspark.__version__ > >>> > >>> > '3.0.0.dev0' > > >>> f.array_repeat(f.col("foo"), f.col("bar")) > >>> > >>> > Traceback (most recent call last): > ... > TypeError: Column is not iterable > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28439) pyspark.sql.functions.array_repeat should support Column as count argument
Maciej Szymkiewicz created SPARK-28439: -- Summary: pyspark.sql.functions.array_repeat should support Column as count argument Key: SPARK-28439 URL: https://issues.apache.org/jira/browse/SPARK-28439 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Maciej Szymkiewicz In Scala Spark supports {code:java} (Column, Column) => Column {code} variant of array_repeat, however PySpark doesn't {code:java} >>> import pyspark >>> >>> >>> from pyspark.sql import functions as f >>> >>> >>> pyspark.__version__ >>> >>> '3.0.0.dev0' >>> f.array_repeat(f.col("foo"), f.col("bar")) >>> >>> Traceback (most recent call last): ... TypeError: Column is not iterable {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28438) [SQL] Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema
ShuMing Li created SPARK-28438: -- Summary: [SQL] Ignore metadata's(comments) difference when comparing datasource's schema and user-specific schema Key: SPARK-28438 URL: https://issues.apache.org/jira/browse/SPARK-28438 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: ShuMing Li -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25788) Elastic net penalties for GLMs
[ https://issues.apache.org/jira/browse/SPARK-25788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887874#comment-16887874 ] shahid commented on SPARK-25788: [~pralabhkumar] yeah. Please go ahead. I think I don't have enough time bandwidth to look into the issue. > Elastic net penalties for GLMs > --- > > Key: SPARK-25788 > URL: https://issues.apache.org/jira/browse/SPARK-25788 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.2 >Reporter: Christian Lorentzen >Priority: Major > > Currently, both LinearRegression and LogisticRegression support an elastic > net penality (setElasticNetParam), i.e. L1 and L2 penalties. This feature > could and should also be added to GeneralizedLinearRegression. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28278: Assignee: Terry Kim > Convert and port 'except-all.sql' into UDF test base > > > Key: SPARK-28278 > URL: https://issues.apache.org/jira/browse/SPARK-28278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Terry Kim >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28278. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25090 [https://github.com/apache/spark/pull/25090] > Convert and port 'except-all.sql' into UDF test base > > > Key: SPARK-28278 > URL: https://issues.apache.org/jira/browse/SPARK-28278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28283) Convert and port 'intersect-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28283: Assignee: Terry Kim > Convert and port 'intersect-all.sql' into UDF test base > --- > > Key: SPARK-28283 > URL: https://issues.apache.org/jira/browse/SPARK-28283 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Terry Kim >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28283) Convert and port 'intersect-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28283. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25119 [https://github.com/apache/spark/pull/25119] > Convert and port 'intersect-all.sql' into UDF test base > --- > > Key: SPARK-28283 > URL: https://issues.apache.org/jira/browse/SPARK-28283 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28276: Assignee: Liang-Chi Hsieh (was: Hyukjin Kwon) > Convert and port 'cross-join.sql' into UDF test base > > > Key: SPARK-28276 > URL: https://issues.apache.org/jira/browse/SPARK-28276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28276. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25168 [https://github.com/apache/spark/pull/25168] > Convert and port 'cross-join.sql' into UDF test base > > > Key: SPARK-28276 > URL: https://issues.apache.org/jira/browse/SPARK-28276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28276: Assignee: Hyukjin Kwon > Convert and port 'cross-join.sql' into UDF test base > > > Key: SPARK-28276 > URL: https://issues.apache.org/jira/browse/SPARK-28276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25788) Elastic net penalties for GLMs
[ https://issues.apache.org/jira/browse/SPARK-25788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887858#comment-16887858 ] pralabhkumar commented on SPARK-25788: -- [~shahid] I can work on this . Please let me know if its ok > Elastic net penalties for GLMs > --- > > Key: SPARK-25788 > URL: https://issues.apache.org/jira/browse/SPARK-25788 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.2 >Reporter: Christian Lorentzen >Priority: Major > > Currently, both LinearRegression and LogisticRegression support an elastic > net penality (setElasticNetParam), i.e. L1 and L2 penalties. This feature > could and should also be added to GeneralizedLinearRegression. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28437) Different format when casting interval type to string type
Yuming Wang created SPARK-28437: --- Summary: Different format when casting interval type to string type Key: SPARK-28437 URL: https://issues.apache.org/jira/browse/SPARK-28437 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang *Spark SQL*: {code:sql} spark-sql> select cast(INTERVAL '10' SECOND as string); interval 10 seconds {code} *PostgreSQL*: {code:sql} postgres=# select substr(version(), 0, 16), cast(INTERVAL '10' SECOND as text); substr | text -+-- PostgreSQL 11.3 | 00:00:10 (1 row) {code} *Vertica*: {code:sql} dbadmin=> select version(), cast(INTERVAL '10' SECOND as varchar(255)); version | ?column? +-- Vertica Analytic Database v9.1.1-0 | 10 (1 row) {code} *Presto*: {code:sql} presto> select cast(INTERVAL '10' SECOND as varchar(255)); _col0 0 00:00:10.000 (1 row) {code} *Oracle*: {code:sql} SQL> select cast(INTERVAL '10' SECOND as varchar(255)) from dual; CAST(INTERVAL'10'SECONDASVARCHAR(255)) INTERVAL'+00 00:00:10.00'DAY TO SECOND {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26796) Testcases failing with "org.apache.hadoop.fs.ChecksumException" error
[ https://issues.apache.org/jira/browse/SPARK-26796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade reopened SPARK-26796: --- > Testcases failing with "org.apache.hadoop.fs.ChecksumException" error > - > > Key: SPARK-26796 > URL: https://issues.apache.org/jira/browse/SPARK-26796 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.2, 2.4.0 > Environment: Ubuntu 16.04 > Java Version > openjdk version "1.8.0_192" > OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9) > Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Compressed References > 20181107_80 (JIT enabled, AOT enabled) > OpenJ9 - 090ff9dcd > OMR - ea548a66 > JCL - b5a3affe73 based on jdk8u192-b12) > > Hadoop Version > Hadoop 2.7.1 > Subversion Unknown -r Unknown > Compiled by test on 2019-01-29T09:09Z > Compiled with protoc 2.5.0 > From source with checksum 5e94a235f9a71834e2eb73fb36ee873f > This command was run using > /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar > > > >Reporter: Anuja Jakhade >Priority: Major > > Observing test case failures due to Checksum error > Below is the error log > [ERROR] checkpointAndComputation(test.org.apache.spark.JavaAPISuite) Time > elapsed: 1.232 s <<< ERROR! > org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor > driver): org.apache.hadoop.fs.ChecksumException: Checksum error: > file:/home/test/spark/core/target/tmp/1548319689411-0/fd0ba388-539c-49aa-bf76-e7d50aa2d1fc/rdd-0/part-0 > at 0 exp: 222499834 got: 1400184476 > at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:323) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279) > at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214) > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2769) > at > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2785) > at > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3262) > at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:968) > at java.io.ObjectInputStream.(ObjectInputStream.java:390) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:300) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:813) > Driver stacktrace: > at > test.org.apache.spark.JavaAPISuite.checkpointAndComputation(JavaAPISuite.java:1243) > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28436) [SQL] Throw better exception when datasource's schema is not equal to user-specific shema
[ https://issues.apache.org/jira/browse/SPARK-28436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShuMing Li updated SPARK-28436: --- Affects Version/s: 2.3.0 > [SQL] Throw better exception when datasource's schema is not equal to > user-specific shema > - > > Key: SPARK-28436 > URL: https://issues.apache.org/jira/browse/SPARK-28436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.3 >Reporter: ShuMing Li >Priority: Minor > > When this exception is thrown, users cannot find what's the difference > between datasource's original schema and user-specific schema, and maybe very > confused when meet the exception below. > {code:java} > org.apache.spark.sql.AnalysisException: org.apache.spark.odps.datasource does > not allow user-specified schemas. > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3270) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:83) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3269) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:653) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:714) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28436) [SQL] Throw better exception when datasource's schema is not equal to user-specific shema
ShuMing Li created SPARK-28436: -- Summary: [SQL] Throw better exception when datasource's schema is not equal to user-specific shema Key: SPARK-28436 URL: https://issues.apache.org/jira/browse/SPARK-28436 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: ShuMing Li When this exception is thrown, users cannot find what's the difference between datasource's original schema and user-specific schema, and maybe very confused when meet the exception below. ``` org.apache.spark.sql.AnalysisException: org.apache.spark.odps.datasource does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3270) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:83) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3269) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:653) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:714) ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28436) [SQL] Throw better exception when datasource's schema is not equal to user-specific shema
[ https://issues.apache.org/jira/browse/SPARK-28436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShuMing Li updated SPARK-28436: --- Description: When this exception is thrown, users cannot find what's the difference between datasource's original schema and user-specific schema, and maybe very confused when meet the exception below. {code:java} org.apache.spark.sql.AnalysisException: org.apache.spark.odps.datasource does not allow user-specified schemas. at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3270) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:83) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3269) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:653) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:714) {code} was: When this exception is thrown, users cannot find what's the difference between datasource's original schema and user-specific schema, and maybe very confused when meet the exception below. ``` org.apache.spark.sql.AnalysisException: org.apache.spark.odps.datasource does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3270) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:83) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3269) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:653) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:714) ``` > [SQL] Throw better exception when datasource's schema is not equal to > user-specific shema > - > > Key: SPARK-28436 > URL: https://issues.apache.org/jira/browse/SPARK-28436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: ShuMing Li >Priority: Minor > > When this exception is thrown, users cannot find what's the difference > between datasource's original schema and user-specific schema, and maybe very > confused when meet the exception below. > {code:java} > org.apache.spark.sql.AnalysisException: org.apache.spark.odps.datasource does > not allow user-specified schemas. > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3270) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:83) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3269) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at
[jira] [Commented] (SPARK-28424) Improve interval input
[ https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887754#comment-16887754 ] Yuming Wang commented on SPARK-28424: - I'm working on. > Improve interval input > --- > > Key: SPARK-28424 > URL: https://issues.apache.org/jira/browse/SPARK-28424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Example: > {code:sql} > INTERVAL '1 day 2:03:04' > {code} > https://www.postgresql.org/docs/11/datatype-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28411) insertInto with overwrite inconsistent behaviour Python/Scala
[ https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887725#comment-16887725 ] Maria Rebelka commented on SPARK-28411: --- Great, thank you! > insertInto with overwrite inconsistent behaviour Python/Scala > - > > Key: SPARK-28411 > URL: https://issues.apache.org/jira/browse/SPARK-28411 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.1, 2.4.0 >Reporter: Maria Rebelka >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour > between Scala and Python. In Python, insertInto ignores "mode" parameter and > appends by default. Only when changing syntax to df.write.insertInto("table", > overwrite=True) we get expected behaviour. > This is a native Spark syntax, expected to be the same between languages... > Also, in other write methods, like saveAsTable or write.parquet "mode" seem > to be respected. > Reproduce, Python, ignore "overwrite": > {code:java} > df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j']) > # create the table and load data > df.write.saveAsTable("spark_overwrite_issue") > # insert overwrite, expected result - 2 rows > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 4 rows, insert appended data instead of overwrite{code} > Reproduce, Scala, works as expected: > {code:java} > val df = Seq((1, 2),(3,4)).toDF("i","j") > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 2 rows{code} > Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28429) SQL Datetime util function being casted to double instead of timestamp
[ https://issues.apache.org/jira/browse/SPARK-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28429: Comment: was deleted (was: I'm working on.) > SQL Datetime util function being casted to double instead of timestamp > -- > > Key: SPARK-28429 > URL: https://issues.apache.org/jira/browse/SPARK-28429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > In the code below, 'now()+'100 days' are casted to double and then an error > is thrown: > {code:sql} > CREATE TEMP VIEW v_window AS > SELECT i, min(i) over (order by i range between '1 day' preceding and '10 > days' following) as min_i > FROM range(now(), now()+'100 days', '1 hour') i; > {code} > Error: > {code:sql} > cannot resolve '(current_timestamp() + CAST('100 days' AS DOUBLE))' due to > data type mismatch: differing types in '(current_timestamp() + CAST('100 > days' AS DOUBLE))' (timestamp and double).;{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887707#comment-16887707 ] Liang-Chi Hsieh commented on SPARK-28288: - Those errors can be found in original window.sql. Seems fine. > Convert and port 'window.sql' into UDF test base > > > Key: SPARK-28288 > URL: https://issues.apache.org/jira/browse/SPARK-28288 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
[ https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28194. -- Resolution: Cannot Reproduce So, this is likely a duplicate of another JIRA. Let's find the JIRA that fixed this issue and see if we can backport. > [SQL] A NoSuchElementException maybe thrown when EnsureRequirement > -- > > Key: SPARK-28194 > URL: https://issues.apache.org/jira/browse/SPARK-28194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28429) SQL Datetime util function being casted to double instead of timestamp
[ https://issues.apache.org/jira/browse/SPARK-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887673#comment-16887673 ] Yuming Wang commented on SPARK-28429: - I'm working on. > SQL Datetime util function being casted to double instead of timestamp > -- > > Key: SPARK-28429 > URL: https://issues.apache.org/jira/browse/SPARK-28429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > In the code below, 'now()+'100 days' are casted to double and then an error > is thrown: > {code:sql} > CREATE TEMP VIEW v_window AS > SELECT i, min(i) over (order by i range between '1 day' preceding and '10 > days' following) as min_i > FROM range(now(), now()+'100 days', '1 hour') i; > {code} > Error: > {code:sql} > cannot resolve '(current_timestamp() + CAST('100 days' AS DOUBLE))' due to > data type mismatch: differing types in '(current_timestamp() + CAST('100 > days' AS DOUBLE))' (timestamp and double).;{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28435) Support cast StringType to IntervalType for SQL interface
[ https://issues.apache.org/jira/browse/SPARK-28435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28435: Summary: Support cast StringType to IntervalType for SQL interface (was: Support cast string to interval for SQL interface) > Support cast StringType to IntervalType for SQL interface > - > > Key: SPARK-28435 > URL: https://issues.apache.org/jira/browse/SPARK-28435 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Scala interface support cast string to interval: > {code:scala} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.expressions._ > Cast(Literal("interval 3 month 1 hours"), CalendarIntervalType).eval() > res0: Any = interval 3 months 1 hours > {code} > But SQL interface does not support it: > {code:sql} > scala> spark.sql("SELECT CAST('interval 3 month 1 hour' AS interval)").show > org.apache.spark.sql.catalyst.parser.ParseException: > DataType interval is not supported.(line 1, pos 41) > == SQL == > SELECT CAST('interval 3 month 1 hour' AS interval) > -^^^ > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPrimitiveDataType$1(AstBuilder.scala:1931) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:1909) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:52) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$PrimitiveDataTypeContext.accept(SqlBaseParser.java:15397) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:58) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSparkDataType(AstBuilder.scala:1903) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitCast$1(AstBuilder.scala:1334) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28435) Support cast string to interval for SQL interface
Yuming Wang created SPARK-28435: --- Summary: Support cast string to interval for SQL interface Key: SPARK-28435 URL: https://issues.apache.org/jira/browse/SPARK-28435 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Scala interface support cast string to interval: {code:scala} import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.expressions._ Cast(Literal("interval 3 month 1 hours"), CalendarIntervalType).eval() res0: Any = interval 3 months 1 hours {code} But SQL interface does not support it: {code:sql} scala> spark.sql("SELECT CAST('interval 3 month 1 hour' AS interval)").show org.apache.spark.sql.catalyst.parser.ParseException: DataType interval is not supported.(line 1, pos 41) == SQL == SELECT CAST('interval 3 month 1 hour' AS interval) -^^^ at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPrimitiveDataType$1(AstBuilder.scala:1931) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:1909) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:52) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$PrimitiveDataTypeContext.accept(SqlBaseParser.java:15397) at org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:58) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSparkDataType(AstBuilder.scala:1903) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitCast$1(AstBuilder.scala:1334) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org