[jira] [Assigned] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source
[ https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25391: Assignee: Apache Spark > Make behaviors consistent when converting parquet hive table to parquet data > source > --- > > Key: SPARK-25391 > URL: https://issues.apache.org/jira/browse/SPARK-25391 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Apache Spark >Priority: Major > > parquet data source tables and hive parquet tables have different behaviors > about parquet field resolution. So, when > {{spark.sql.hive.convertMetastoreParquet}} is true, users might face > inconsistent behaviors. The differences are: > * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both > data source tables and hive tables do NOT respect > {{spark.sql.caseSensitive}}. However data source tables always do > case-sensitive parquet field resolution, while hive tables always do > case-insensitive parquet field resolution no matter whether > {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data > source tables respect {{spark.sql.caseSensitive}} while hive serde table > behavior is not changed. > * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, > data source tables do case-sensitive resolution and return columns with the > corresponding letter cases, while hive tables always return the first matched > column ignoring cases. SPARK-25132 let data source tables throw exception > when there is ambiguity while hive table behavior is not changed. > This ticket aims to make behaviors consistent when converting hive table to > data source table. > * The behavior must be consistent to do the conversion, so we skip the > conversion in case-sensitive mode because hive parquet table always do > case-insensitive field resolution. > * In case-insensitive mode, when converting hive parquet table to parquet > data source, we switch the duplicated fields resolution mode to ask parquet > data source to pick the first matched field - the same behavior as hive > parquet table - to keep behaviors consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source
[ https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25391: Assignee: (was: Apache Spark) > Make behaviors consistent when converting parquet hive table to parquet data > source > --- > > Key: SPARK-25391 > URL: https://issues.apache.org/jira/browse/SPARK-25391 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > parquet data source tables and hive parquet tables have different behaviors > about parquet field resolution. So, when > {{spark.sql.hive.convertMetastoreParquet}} is true, users might face > inconsistent behaviors. The differences are: > * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both > data source tables and hive tables do NOT respect > {{spark.sql.caseSensitive}}. However data source tables always do > case-sensitive parquet field resolution, while hive tables always do > case-insensitive parquet field resolution no matter whether > {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data > source tables respect {{spark.sql.caseSensitive}} while hive serde table > behavior is not changed. > * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, > data source tables do case-sensitive resolution and return columns with the > corresponding letter cases, while hive tables always return the first matched > column ignoring cases. SPARK-25132 let data source tables throw exception > when there is ambiguity while hive table behavior is not changed. > This ticket aims to make behaviors consistent when converting hive table to > data source table. > * The behavior must be consistent to do the conversion, so we skip the > conversion in case-sensitive mode because hive parquet table always do > case-insensitive field resolution. > * In case-insensitive mode, when converting hive parquet table to parquet > data source, we switch the duplicated fields resolution mode to ask parquet > data source to pick the first matched field - the same behavior as hive > parquet table - to keep behaviors consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source
[ https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608750#comment-16608750 ] Apache Spark commented on SPARK-25391: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/22343 > Make behaviors consistent when converting parquet hive table to parquet data > source > --- > > Key: SPARK-25391 > URL: https://issues.apache.org/jira/browse/SPARK-25391 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > parquet data source tables and hive parquet tables have different behaviors > about parquet field resolution. So, when > {{spark.sql.hive.convertMetastoreParquet}} is true, users might face > inconsistent behaviors. The differences are: > * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both > data source tables and hive tables do NOT respect > {{spark.sql.caseSensitive}}. However data source tables always do > case-sensitive parquet field resolution, while hive tables always do > case-insensitive parquet field resolution no matter whether > {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data > source tables respect {{spark.sql.caseSensitive}} while hive serde table > behavior is not changed. > * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, > data source tables do case-sensitive resolution and return columns with the > corresponding letter cases, while hive tables always return the first matched > column ignoring cases. SPARK-25132 let data source tables throw exception > when there is ambiguity while hive table behavior is not changed. > This ticket aims to make behaviors consistent when converting hive table to > data source table. > * The behavior must be consistent to do the conversion, so we skip the > conversion in case-sensitive mode because hive parquet table always do > case-insensitive field resolution. > * In case-insensitive mode, when converting hive parquet table to parquet > data source, we switch the duplicated fields resolution mode to ask parquet > data source to pick the first matched field - the same behavior as hive > parquet table - to keep behaviors consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source
Chenxiao Mao created SPARK-25391: Summary: Make behaviors consistent when converting parquet hive table to parquet data source Key: SPARK-25391 URL: https://issues.apache.org/jira/browse/SPARK-25391 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Chenxiao Mao parquet data source tables and hive parquet tables have different behaviors about parquet field resolution. So, when {{spark.sql.hive.convertMetastoreParquet}} is true, users might face inconsistent behaviors. The differences are: * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both data source tables and hive tables do NOT respect {{spark.sql.caseSensitive}}. However data source tables always do case-sensitive parquet field resolution, while hive tables always do case-insensitive parquet field resolution no matter whether {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data source tables respect {{spark.sql.caseSensitive}} while hive serde table behavior is not changed. * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive tables always return the first matched column ignoring cases. SPARK-25132 let data source tables throw exception when there is ambiguity while hive table behavior is not changed. This ticket aims to make behaviors consistent when converting hive table to data source table. * The behavior must be consistent to do the conversion, so we skip the conversion in case-sensitive mode because hive parquet table always do case-insensitive field resolution. * In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25390) finalize the abstraction of data source V2 API
Wenchen Fan created SPARK-25390: --- Summary: finalize the abstraction of data source V2 API Key: SPARK-25390 URL: https://issues.apache.org/jira/browse/SPARK-25390 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Currently it's not very clear how we should abstract data source v2 API. The abstraction should be unified between batch and streaming, or similar but have a well-defined difference between batch and streaming. And the abstraction should also include catalog/table. An example of the abstraction: {code} batch: catalog -> table -> scan streaming: catalog -> table -> stream -> scan {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
[ https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25389: -- Description: Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files. *INSERT OVERWRITE DIRECTORY USING* {code} scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id") ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`; {code} *INSERT OVERWRITE DIRECTORY STORED AS* {code} scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id") scala> spark.read.parquet("/tmp/parquet").show 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`; {code} was: Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files. *INSERT OVERWRITE DIRECTORY USING* {code} scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id") 18/09/09 22:11:29 ERROR InsertIntoDataSourceDirCommand: Failed to write to directory Some(file:///tmp/parquet) org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`; {code} *INSERT OVERWRITE DIRECTORY STORED AS* {code} scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id") scala> spark.read.parquet("/tmp/parquet").show 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`; {code} > INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields > > > Key: SPARK-25389 > URL: https://issues.apache.org/jira/browse/SPARK-25389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Dongjoon Hyun >Priority: Major > > Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY > STORED AS` should not generate files with duplicate fields because Spark > cannot read those files. > *INSERT OVERWRITE DIRECTORY USING* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet > SELECT 'id', 'id2' id") > ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... > org.apache.spark.sql.AnalysisException: Found duplicate column(s) when > inserting into file:/tmp/parquet: `id`; > {code} > *INSERT OVERWRITE DIRECTORY STORED AS* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS > parquet SELECT 'id', 'id2' id") > scala> spark.read.parquet("/tmp/parquet").show > 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data > schema and the partition schema: `id`; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
[ https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25389: Assignee: (was: Apache Spark) > INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields > > > Key: SPARK-25389 > URL: https://issues.apache.org/jira/browse/SPARK-25389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Dongjoon Hyun >Priority: Major > > Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY > STORED AS` should not generate files with duplicate fields because Spark > cannot read those files. > *INSERT OVERWRITE DIRECTORY USING* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet > SELECT 'id', 'id2' id") > ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... > org.apache.spark.sql.AnalysisException: Found duplicate column(s) when > inserting into file:/tmp/parquet: `id`; > {code} > *INSERT OVERWRITE DIRECTORY STORED AS* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS > parquet SELECT 'id', 'id2' id") > scala> spark.read.parquet("/tmp/parquet").show > 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data > schema and the partition schema: `id`; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
[ https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608741#comment-16608741 ] Apache Spark commented on SPARK-25389: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22378 > INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields > > > Key: SPARK-25389 > URL: https://issues.apache.org/jira/browse/SPARK-25389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Dongjoon Hyun >Priority: Major > > Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY > STORED AS` should not generate files with duplicate fields because Spark > cannot read those files. > *INSERT OVERWRITE DIRECTORY USING* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet > SELECT 'id', 'id2' id") > ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... > org.apache.spark.sql.AnalysisException: Found duplicate column(s) when > inserting into file:/tmp/parquet: `id`; > {code} > *INSERT OVERWRITE DIRECTORY STORED AS* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS > parquet SELECT 'id', 'id2' id") > scala> spark.read.parquet("/tmp/parquet").show > 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data > schema and the partition schema: `id`; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
[ https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25389: Assignee: Apache Spark > INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields > > > Key: SPARK-25389 > URL: https://issues.apache.org/jira/browse/SPARK-25389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY > STORED AS` should not generate files with duplicate fields because Spark > cannot read those files. > *INSERT OVERWRITE DIRECTORY USING* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet > SELECT 'id', 'id2' id") > ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... > org.apache.spark.sql.AnalysisException: Found duplicate column(s) when > inserting into file:/tmp/parquet: `id`; > {code} > *INSERT OVERWRITE DIRECTORY STORED AS* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS > parquet SELECT 'id', 'id2' id") > scala> spark.read.parquet("/tmp/parquet").show > 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data > schema and the partition schema: `id`; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
[ https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25389: -- Summary: INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields (was: INSERT OVERWRITE DIRECTORY STORED AS should not generate files with duplicate fields) > INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields > > > Key: SPARK-25389 > URL: https://issues.apache.org/jira/browse/SPARK-25389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Dongjoon Hyun >Priority: Major > > Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY > STORED AS` should not generate files with duplicate fields because Spark > cannot read those files. > *INSERT OVERWRITE DIRECTORY USING* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet > SELECT 'id', 'id2' id") > 18/09/09 22:11:29 ERROR InsertIntoDataSourceDirCommand: Failed to write to > directory Some(file:///tmp/parquet) > org.apache.spark.sql.AnalysisException: Found duplicate column(s) when > inserting into file:/tmp/parquet: `id`; > {code} > *INSERT OVERWRITE DIRECTORY STORED AS* > {code} > scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS > parquet SELECT 'id', 'id2' id") > scala> spark.read.parquet("/tmp/parquet").show > 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data > schema and the partition schema: `id`; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should not generate files with duplicate fields
Dongjoon Hyun created SPARK-25389: - Summary: INSERT OVERWRITE DIRECTORY STORED AS should not generate files with duplicate fields Key: SPARK-25389 URL: https://issues.apache.org/jira/browse/SPARK-25389 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.3.0 Reporter: Dongjoon Hyun Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files. *INSERT OVERWRITE DIRECTORY USING* {code} scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id") 18/09/09 22:11:29 ERROR InsertIntoDataSourceDirCommand: Failed to write to directory Some(file:///tmp/parquet) org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`; {code} *INSERT OVERWRITE DIRECTORY STORED AS* {code} scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id") scala> spark.read.parquet("/tmp/parquet").show 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608701#comment-16608701 ] Apache Spark commented on SPARK-24911: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22377 > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608697#comment-16608697 ] Apache Spark commented on SPARK-24911: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22377 > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608698#comment-16608698 ] Apache Spark commented on SPARK-24911: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22377 > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608699#comment-16608699 ] Apache Spark commented on SPARK-24911: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22377 > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608696#comment-16608696 ] Apache Spark commented on SPARK-24849: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22377 > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608695#comment-16608695 ] Apache Spark commented on SPARK-24849: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22377 > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20918) Use FunctionIdentifier as function identifiers in FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20918: Description: Currently, the unquoted string of a function identifier is being used as the function identifier in the function registry. This could cause the incorrect the behavior when users use `.` in the function names. As an example, Spark can resolve a function like this {code} SELECT `d100.udf100`(`emp`.`name`) FROM `emp`; {code} Although the function name is wrapped with backticks, Spark still resolves it as database name + function name, which is wrong. was:Currently, the unquoted string of a function identifier is being used as the function identifier in the function registry. This could cause the incorrect the behavior when users use `.` in the function names. > Use FunctionIdentifier as function identifiers in FunctionRegistry > -- > > Key: SPARK-20918 > URL: https://issues.apache.org/jira/browse/SPARK-20918 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.0 > > > Currently, the unquoted string of a function identifier is being used as the > function identifier in the function registry. This could cause the incorrect > the behavior when users use `.` in the function names. > As an example, Spark can resolve a function like this > {code} > SELECT `d100.udf100`(`emp`.`name`) FROM `emp`; > {code} > Although the function name is wrapped with backticks, Spark still resolves it > as database name + function name, which is wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20918) Use FunctionIdentifier as function identifiers in FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20918: Issue Type: Bug (was: Improvement) > Use FunctionIdentifier as function identifiers in FunctionRegistry > -- > > Key: SPARK-20918 > URL: https://issues.apache.org/jira/browse/SPARK-20918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.0 > > > Currently, the unquoted string of a function identifier is being used as the > function identifier in the function registry. This could cause the incorrect > the behavior when users use `.` in the function names. > As an example, Spark can resolve a function like this > {code} > SELECT `d100.udf100`(`emp`.`name`) FROM `emp`; > {code} > Although the function name is wrapped with backticks, Spark still resolves it > as database name + function name, which is wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) The column attributes obtained by Spark sql are inconsistent with hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Description: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: {code:java} val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} hive: {code:java} hive> desc people_test; OK age bigint name string Time taken: 0.454 seconds, Fetched: 2 row(s) hive> alter table people_test change column age age double; OK Time taken: 0.68 seconds hive> desc people_test; OK age double name string Time taken: 0.358 seconds, Fetched: 2 row(s){code} spark-shell: {code:java} spark.catalog.refreshTable("people_test") spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. was: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: {code:java} val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} hive: {code:java} hive> desc people_test; OK age bigint name string Time taken: 0.454 seconds, Fetched: 2 row(s) hive> alter table people_test change column age age1 double; OK Time taken: 0.68 seconds hive> desc people_test; OK age1 double name string Time taken: 0.358 seconds, Fetched: 2 row(s){code} spark-shell: {code:java} spark.catalog.refreshTable("people_test") spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. > The column attributes obtained by Spark sql are inconsistent with hive > -- > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| >
[jira] [Updated] (SPARK-25367) The column attributes obtained by Spark sql are inconsistent with hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Priority: Critical (was: Major) > The column attributes obtained by Spark sql are inconsistent with hive > -- > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Critical > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608676#comment-16608676 ] Apache Spark commented on SPARK-25021: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22376 > Add spark.executor.pyspark.memory support to Kubernetes > --- > > Key: SPARK-25021 > URL: https://issues.apache.org/jira/browse/SPARK-25021 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Ilan Filonenko >Priority: Major > Fix For: 3.0.0 > > > SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory > allocation for PySpark and updates YARN to add this memory to its container > requests. Kubernetes should do something similar to account for the python > memory allocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray assume UTF8String
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608675#comment-16608675 ] Wenchen Fan commented on SPARK-25378: - ArrayData is not a public interface, so I won't treat it as a breaking change. I'd say there is a bug in 2.3.1 that allows it, and the current behavior is expected. I understand how hard it is to write a connector without using internal APIs, hopefully the connectors don't need to rely on internal APIs anymore after we stabilize the data source v2. Shall we close it as "not a problem"? > ArrayData.toArray assume UTF8String > --- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) The column attributes obtained by Spark sql are inconsistent with hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Summary: The column attributes obtained by Spark sql are inconsistent with hive (was: Spark sql get incompatiable column schema as in hiveThe column attributes obtained by Spark sql are inconsistent with hive) > The column attributes obtained by Spark sql are inconsistent with hive > -- > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) Spark sql get incompatiable column schema as in hiveThe column attributes obtained by Spark sql are inconsistent with hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Summary: Spark sql get incompatiable column schema as in hiveThe column attributes obtained by Spark sql are inconsistent with hive (was: Spark sql get incompatiable column schema as in hive) > Spark sql get incompatiable column schema as in hiveThe column attributes > obtained by Spark sql are inconsistent with hive > -- > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) Spark sql get incompatiable column schema as in hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Summary: Spark sql get incompatiable column schema as in hive (was: Spark sql get incompatiable schema as in hive) > Spark sql get incompatiable column schema as in hive > > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) Spark sql get incompatiable schema as in hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Summary: Spark sql get incompatiable schema as in hive (was: Spark sql got incompatiable schema as in hive) > Spark sql get incompatiable schema as in hive > - > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) Spark sql got incompatiable schema as in hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Summary: Spark sql got incompatiable schema as in hive (was: Hive table created by Spark dataFrame has incompatiable schema in spark and hive) > Spark sql got incompatiable schema as in hive > - > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Fix Version/s: (was: 2.3.2) > Hive table created by Spark dataFrame has incompatiable schema in spark and > hive > > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Description: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: {code:java} val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} hive: {code:java} hive> desc people_test; OK age bigint name string Time taken: 0.454 seconds, Fetched: 2 row(s) hive> alter table people_test change column age age1 double; OK Time taken: 0.68 seconds hive> desc people_test; OK age1 double name string Time taken: 0.358 seconds, Fetched: 2 row(s){code} spark-shell: {code:java} spark.catalog.refreshTable("people_test") spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. was: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: {code:java} val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.catalog.refreshTable("people_test") spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} hive: {code:java} hive> desc people_test; OK age bigint name string Time taken: 0.454 seconds, Fetched: 2 row(s) hive> alter table people_test change column age age1 double; OK Time taken: 0.68 seconds hive> desc people_test; OK age1 double name string Time taken: 0.358 seconds, Fetched: 2 row(s){code} spark-shell: {code:java} spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. > Hive table created by Spark dataFrame has incompatiable schema in spark and > hive > > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > Fix For: 2.3.2 > > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() >
[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Description: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: {code:java} val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.catalog.refreshTable("people_test") spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} hive: {code:java} hive> desc people_test; OK age bigint name string Time taken: 0.454 seconds, Fetched: 2 row(s) hive> alter table people_test change column age age1 double; OK Time taken: 0.68 seconds hive> desc people_test; OK age1 double name string Time taken: 0.358 seconds, Fetched: 2 row(s){code} spark-shell: {code:java} spark.sql("desc people_test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | age| bigint| null| | name| string| null| ++-+---+ {code} We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. was: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: {code:java} val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.catalog.refreshTable("people_test") spark.sql("desc people").show() {code} hive: {code:java} alter table people_test change column age age1 double; desc people_test;{code} spark-shell: {code:java} spark.sql("desc people").show() {code} We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. > Hive table created by Spark dataFrame has incompatiable schema in spark and > hive > > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > Fix For: 2.3.2 > > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.catalog.refreshTable("people_test") > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > hive: > {code:java} > hive> desc people_test; > OK > age bigint > name string > Time taken: 0.454 seconds, Fetched: 2 row(s) > hive> alter table people_test change column age age1 double; > OK > Time taken: 0.68 seconds > hive> desc people_test; > OK > age1 double > name string > Time taken: 0.358 seconds, Fetched: 2 row(s){code} > spark-shell: > {code:java} > spark.sql("desc people_test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | age| bigint| null| > | name| string| null| > ++-+---+ > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands,
[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Description: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: {code:java} val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.catalog.refreshTable("people_test") spark.sql("desc people").show() {code} hive: {code:java} alter table people_test change column age age1 double; desc people_test;{code} spark-shell: {code:java} spark.sql("desc people").show() {code} We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. was: We save the dataframe object as a hive table in orc/parquet format in the spark shell. After we modified the column type (int to double) of this table in hive jdbc, we found the column type queried in spark-shell didn't change, but changed in hive jdbc. After we restarted the spark-shell, this table's column type is still incompatible as showed in hive jdbc. The coding process are as follows: spark-shell: val df = spark.read.json("examples/src/main/resources/people.json"); df.write.format("orc").saveAsTable("people_test"); spark.catalog.refreshTable("people_test") spark.sql("desc people").show() hive: alter table people_test change column age age1 double; desc people_test; spark-shell: spark.sql("desc people").show() We also tested in spark-shell by creating a table using spark.sql("create table XXX()"), the modified columns are consistent. > Hive table created by Spark dataFrame has incompatiable schema in spark and > hive > > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > Fix For: 2.3.2 > > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > {code:java} > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.catalog.refreshTable("people_test") > spark.sql("desc people").show() > {code} > > hive: > > {code:java} > alter table people_test change column age age1 double; > desc people_test;{code} > spark-shell: > {code:java} > spark.sql("desc people").show() > {code} > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive
[ https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yy updated SPARK-25367: --- Affects Version/s: 2.2.0 2.2.2 2.3.0 2.3.1 Target Version/s: 2.3.1 Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 hive-1.2.1 was: spark2.2.1 hive1.2.1 Fix Version/s: 2.3.2 Component/s: SQL > Hive table created by Spark dataFrame has incompatiable schema in spark and > hive > > > Key: SPARK-25367 > URL: https://issues.apache.org/jira/browse/SPARK-25367 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2 > hive-1.2.1 >Reporter: yy >Priority: Major > Labels: sparksql > Fix For: 2.3.2 > > > We save the dataframe object as a hive table in orc/parquet format in the > spark shell. > After we modified the column type (int to double) of this table in hive > jdbc, we found the column type queried in spark-shell didn't change, but > changed in hive jdbc. After we restarted the spark-shell, this table's column > type is still incompatible as showed in hive jdbc. > The coding process are as follows: > spark-shell: > val df = spark.read.json("examples/src/main/resources/people.json"); > df.write.format("orc").saveAsTable("people_test"); > spark.catalog.refreshTable("people_test") > spark.sql("desc people").show() > hive: > alter table people_test change column age age1 double; > desc people_test; > spark-shell: > spark.sql("desc people").show() > > We also tested in spark-shell by creating a table using spark.sql("create > table XXX()"), the modified columns are consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25175: -- Fix Version/s: (was: 3.0.0) > Field resolution should fail if there's ambiguity for ORC native reader > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.4.0 > > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25175: - Assignee: Chenxiao Mao > Field resolution should fail if there's ambiguity for ORC native reader > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25175. --- Resolution: Fixed Fix Version/s: 2.4.0 3.0.0 Issue resolved by pull request 22262 [https://github.com/apache/spark/pull/22262] > Field resolution should fail if there's ambiguity for ORC native reader > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 3.0.0, 2.4.0 > > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result
[ https://issues.apache.org/jira/browse/SPARK-25388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25388: Assignee: (was: Apache Spark) > checkEvaluation may miss incorrect nullable of DataType in the result > - > > Key: SPARK-25388 > URL: https://issues.apache.org/jira/browse/SPARK-25388 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in > {{checkEvaluationWithUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result
[ https://issues.apache.org/jira/browse/SPARK-25388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25388: Assignee: Apache Spark > checkEvaluation may miss incorrect nullable of DataType in the result > - > > Key: SPARK-25388 > URL: https://issues.apache.org/jira/browse/SPARK-25388 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Minor > > Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in > {{checkEvaluationWithUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result
[ https://issues.apache.org/jira/browse/SPARK-25388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608624#comment-16608624 ] Apache Spark commented on SPARK-25388: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/22375 > checkEvaluation may miss incorrect nullable of DataType in the result > - > > Key: SPARK-25388 > URL: https://issues.apache.org/jira/browse/SPARK-25388 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in > {{checkEvaluationWithUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result
Kazuaki Ishizaki created SPARK-25388: Summary: checkEvaluation may miss incorrect nullable of DataType in the result Key: SPARK-25388 URL: https://issues.apache.org/jira/browse/SPARK-25388 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in {{checkEvaluationWithUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25366) Zstd and brotli CompressionCodec are not supported for parquet files
[ https://issues.apache.org/jira/browse/SPARK-25366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25366: Summary: Zstd and brotli CompressionCodec are not supported for parquet files (was: Zstd and brotil CompressionCodec are not supported for parquet files) > Zstd and brotli CompressionCodec are not supported for parquet files > - > > Key: SPARK-25366 > URL: https://issues.apache.org/jira/browse/SPARK-25366 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class > org.apache.hadoop.io.compress.*BrotliCodec* was not found > at > org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) > at > org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) > at > org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) > at > org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) > at > org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) > > > > > > Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class > org.apache.hadoop.io.compress.*{color:#33}ZStandardCodec{color}* was not > found > at > org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) > at > org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) > at > org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) > at > org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) > at > org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?
[ https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608612#comment-16608612 ] Vincent commented on SPARK-25364: - duplication. close this Jira. > a better way to handle vector index and sparsity in FeatureHasher > implementation ? > -- > > Key: SPARK-25364 > URL: https://issues.apache.org/jira/browse/SPARK-25364 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be updated with the sum of > current and old value, ie, the value of the conflicted feature vector would > be change by this module. > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training > we are working on fixing these problems due to our business need, thinking it > might or might not be an issue for others as well, we'd like to hear from the > community. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?
[ https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent resolved SPARK-25364. - Resolution: Duplicate > a better way to handle vector index and sparsity in FeatureHasher > implementation ? > -- > > Key: SPARK-25364 > URL: https://issues.apache.org/jira/browse/SPARK-25364 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be updated with the sum of > current and old value, ie, the value of the conflicted feature vector would > be change by this module. > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training > we are working on fixing these problems due to our business need, thinking it > might or might not be an issue for others as well, we'd like to hear from the > community. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25368) Incorrect constraint inference returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25368: -- Affects Version/s: (was: 2.3.2) 2.3.0 > Incorrect constraint inference returns wrong result > --- > > Key: SPARK-25368 > URL: https://issues.apache.org/jira/browse/SPARK-25368 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Lev Katzav >Assignee: Yuming Wang >Priority: Blocker > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > Attachments: plan.txt > > > there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5) > the following code recreates the problem > (it's a bit convoluted examples, I tried to simplify it as much as possible > from my code) > {code:java} > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions._ > import spark.implicits._ > case class Data(a: Option[Int],b: String,c: Option[String],d: String) > val df1 = spark.createDataFrame(Seq( >Data(Some(1), "1", None, "1"), >Data(None, "2", Some("2"), "2") > )) > val df2 = df1 > .where( $"a".isNotNull) > .withColumn("e", lit(null).cast("string")) > val columns = df2.columns.map(c => col(c)) > val df3 = df1 > .select( > $"c", > $"b" as "e" > ) > .withColumn("a", lit(null).cast("int")) > .withColumn("b", lit(null).cast("string")) > .withColumn("d", lit(null).cast("string")) > .select(columns :_*) > val df4 = > df2.union(df3) > .withColumn("e", last(col("e"), ignoreNulls = > true).over(Window.partitionBy($"c").orderBy($"d"))) > .filter($"a".isNotNull) > df4.show > {code} > > notice that the last statement in for df4 is to filter rows where a is null > in spark 2.2.1, the above code prints: > {code:java} > +---+---++---+---+ > | a| b| c| d| e| > +---+---++---+---+ > | 1| 1|null| 1| 1| > +---+---++---+---+ > {code} > in spark 2.3.x, it prints: > {code:java} > +++++---+ > | a| b| c| d| e| > +++++---+ > |null|null|null|null| 1| > | 1| 1|null| 1| 1| > |null|null| 2|null| 2| > +++++---+ > {code} > the column a still contains null values > > attached are the plans. > in the parsed logical plan, the filter for isnotnull('a), is on top, > but in the optimized logical plan, it is pushed down -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25368) Incorrect constraint inference returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25368: Summary: Incorrect constraint inference returns wrong result (was: Incorrect predicate pushdown returns wrong result) > Incorrect constraint inference returns wrong result > --- > > Key: SPARK-25368 > URL: https://issues.apache.org/jira/browse/SPARK-25368 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.1, 2.3.2 >Reporter: Lev Katzav >Assignee: Yuming Wang >Priority: Blocker > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > Attachments: plan.txt > > > there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5) > the following code recreates the problem > (it's a bit convoluted examples, I tried to simplify it as much as possible > from my code) > {code:java} > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions._ > import spark.implicits._ > case class Data(a: Option[Int],b: String,c: Option[String],d: String) > val df1 = spark.createDataFrame(Seq( >Data(Some(1), "1", None, "1"), >Data(None, "2", Some("2"), "2") > )) > val df2 = df1 > .where( $"a".isNotNull) > .withColumn("e", lit(null).cast("string")) > val columns = df2.columns.map(c => col(c)) > val df3 = df1 > .select( > $"c", > $"b" as "e" > ) > .withColumn("a", lit(null).cast("int")) > .withColumn("b", lit(null).cast("string")) > .withColumn("d", lit(null).cast("string")) > .select(columns :_*) > val df4 = > df2.union(df3) > .withColumn("e", last(col("e"), ignoreNulls = > true).over(Window.partitionBy($"c").orderBy($"d"))) > .filter($"a".isNotNull) > df4.show > {code} > > notice that the last statement in for df4 is to filter rows where a is null > in spark 2.2.1, the above code prints: > {code:java} > +---+---++---+---+ > | a| b| c| d| e| > +---+---++---+---+ > | 1| 1|null| 1| 1| > +---+---++---+---+ > {code} > in spark 2.3.x, it prints: > {code:java} > +++++---+ > | a| b| c| d| e| > +++++---+ > |null|null|null|null| 1| > | 1| 1|null| 1| 1| > |null|null| 2|null| 2| > +++++---+ > {code} > the column a still contains null values > > attached are the plans. > in the parsed logical plan, the filter for isnotnull('a), is on top, > but in the optimized logical plan, it is pushed down -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25368) Incorrect predicate pushdown returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25368: --- Assignee: Yuming Wang > Incorrect predicate pushdown returns wrong result > - > > Key: SPARK-25368 > URL: https://issues.apache.org/jira/browse/SPARK-25368 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.1, 2.3.2 >Reporter: Lev Katzav >Assignee: Yuming Wang >Priority: Blocker > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > Attachments: plan.txt > > > there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5) > the following code recreates the problem > (it's a bit convoluted examples, I tried to simplify it as much as possible > from my code) > {code:java} > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions._ > import spark.implicits._ > case class Data(a: Option[Int],b: String,c: Option[String],d: String) > val df1 = spark.createDataFrame(Seq( >Data(Some(1), "1", None, "1"), >Data(None, "2", Some("2"), "2") > )) > val df2 = df1 > .where( $"a".isNotNull) > .withColumn("e", lit(null).cast("string")) > val columns = df2.columns.map(c => col(c)) > val df3 = df1 > .select( > $"c", > $"b" as "e" > ) > .withColumn("a", lit(null).cast("int")) > .withColumn("b", lit(null).cast("string")) > .withColumn("d", lit(null).cast("string")) > .select(columns :_*) > val df4 = > df2.union(df3) > .withColumn("e", last(col("e"), ignoreNulls = > true).over(Window.partitionBy($"c").orderBy($"d"))) > .filter($"a".isNotNull) > df4.show > {code} > > notice that the last statement in for df4 is to filter rows where a is null > in spark 2.2.1, the above code prints: > {code:java} > +---+---++---+---+ > | a| b| c| d| e| > +---+---++---+---+ > | 1| 1|null| 1| 1| > +---+---++---+---+ > {code} > in spark 2.3.x, it prints: > {code:java} > +++++---+ > | a| b| c| d| e| > +++++---+ > |null|null|null|null| 1| > | 1| 1|null| 1| 1| > |null|null| 2|null| 2| > +++++---+ > {code} > the column a still contains null values > > attached are the plans. > in the parsed logical plan, the filter for isnotnull('a), is on top, > but in the optimized logical plan, it is pushed down -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25368) Incorrect predicate pushdown returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25368: Labels: correctness (was: ) > Incorrect predicate pushdown returns wrong result > - > > Key: SPARK-25368 > URL: https://issues.apache.org/jira/browse/SPARK-25368 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.1, 2.3.2 >Reporter: Lev Katzav >Priority: Blocker > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > Attachments: plan.txt > > > there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5) > the following code recreates the problem > (it's a bit convoluted examples, I tried to simplify it as much as possible > from my code) > {code:java} > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions._ > import spark.implicits._ > case class Data(a: Option[Int],b: String,c: Option[String],d: String) > val df1 = spark.createDataFrame(Seq( >Data(Some(1), "1", None, "1"), >Data(None, "2", Some("2"), "2") > )) > val df2 = df1 > .where( $"a".isNotNull) > .withColumn("e", lit(null).cast("string")) > val columns = df2.columns.map(c => col(c)) > val df3 = df1 > .select( > $"c", > $"b" as "e" > ) > .withColumn("a", lit(null).cast("int")) > .withColumn("b", lit(null).cast("string")) > .withColumn("d", lit(null).cast("string")) > .select(columns :_*) > val df4 = > df2.union(df3) > .withColumn("e", last(col("e"), ignoreNulls = > true).over(Window.partitionBy($"c").orderBy($"d"))) > .filter($"a".isNotNull) > df4.show > {code} > > notice that the last statement in for df4 is to filter rows where a is null > in spark 2.2.1, the above code prints: > {code:java} > +---+---++---+---+ > | a| b| c| d| e| > +---+---++---+---+ > | 1| 1|null| 1| 1| > +---+---++---+---+ > {code} > in spark 2.3.x, it prints: > {code:java} > +++++---+ > | a| b| c| d| e| > +++++---+ > |null|null|null|null| 1| > | 1| 1|null| 1| 1| > |null|null| 2|null| 2| > +++++---+ > {code} > the column a still contains null values > > attached are the plans. > in the parsed logical plan, the filter for isnotnull('a), is on top, > but in the optimized logical plan, it is pushed down -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25368) Incorrect predicate pushdown returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25368. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.2 > Incorrect predicate pushdown returns wrong result > - > > Key: SPARK-25368 > URL: https://issues.apache.org/jira/browse/SPARK-25368 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.1, 2.3.2 >Reporter: Lev Katzav >Assignee: Yuming Wang >Priority: Blocker > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > Attachments: plan.txt > > > there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5) > the following code recreates the problem > (it's a bit convoluted examples, I tried to simplify it as much as possible > from my code) > {code:java} > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions._ > import spark.implicits._ > case class Data(a: Option[Int],b: String,c: Option[String],d: String) > val df1 = spark.createDataFrame(Seq( >Data(Some(1), "1", None, "1"), >Data(None, "2", Some("2"), "2") > )) > val df2 = df1 > .where( $"a".isNotNull) > .withColumn("e", lit(null).cast("string")) > val columns = df2.columns.map(c => col(c)) > val df3 = df1 > .select( > $"c", > $"b" as "e" > ) > .withColumn("a", lit(null).cast("int")) > .withColumn("b", lit(null).cast("string")) > .withColumn("d", lit(null).cast("string")) > .select(columns :_*) > val df4 = > df2.union(df3) > .withColumn("e", last(col("e"), ignoreNulls = > true).over(Window.partitionBy($"c").orderBy($"d"))) > .filter($"a".isNotNull) > df4.show > {code} > > notice that the last statement in for df4 is to filter rows where a is null > in spark 2.2.1, the above code prints: > {code:java} > +---+---++---+---+ > | a| b| c| d| e| > +---+---++---+---+ > | 1| 1|null| 1| 1| > +---+---++---+---+ > {code} > in spark 2.3.x, it prints: > {code:java} > +++++---+ > | a| b| c| d| e| > +++++---+ > |null|null|null|null| 1| > | 1| 1|null| 1| 1| > |null|null| 2|null| 2| > +++++---+ > {code} > the column a still contains null values > > attached are the plans. > in the parsed logical plan, the filter for isnotnull('a), is on top, > but in the optimized logical plan, it is pushed down -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25387) Malformed CSV causes NPE
[ https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25387: Assignee: Apache Spark > Malformed CSV causes NPE > > > Key: SPARK-25387 > URL: https://issues.apache.org/jira/browse/SPARK-25387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Loading a malformed CSV files or a dataset can cause NullPointerException, > for example the code: > {code:scala} > val schema = StructType(StructField("a", IntegerType) :: Nil) > val input = spark.createDataset(Seq("\u\u\u0001234")) > spark.read.schema(schema).csv(input).collect() > {code} > crashes with the exception: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) > {code} > If schema is not specified, the following exception is thrown: > {code:java} > java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) > at > scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) > at > scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25387) Malformed CSV causes NPE
[ https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608476#comment-16608476 ] Apache Spark commented on SPARK-25387: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22374 > Malformed CSV causes NPE > > > Key: SPARK-25387 > URL: https://issues.apache.org/jira/browse/SPARK-25387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > Loading a malformed CSV files or a dataset can cause NullPointerException, > for example the code: > {code:scala} > val schema = StructType(StructField("a", IntegerType) :: Nil) > val input = spark.createDataset(Seq("\u\u\u0001234")) > spark.read.schema(schema).csv(input).collect() > {code} > crashes with the exception: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) > {code} > If schema is not specified, the following exception is thrown: > {code:java} > java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) > at > scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) > at > scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25387) Malformed CSV causes NPE
[ https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608477#comment-16608477 ] Apache Spark commented on SPARK-25387: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22374 > Malformed CSV causes NPE > > > Key: SPARK-25387 > URL: https://issues.apache.org/jira/browse/SPARK-25387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Loading a malformed CSV files or a dataset can cause NullPointerException, > for example the code: > {code:scala} > val schema = StructType(StructField("a", IntegerType) :: Nil) > val input = spark.createDataset(Seq("\u\u\u0001234")) > spark.read.schema(schema).csv(input).collect() > {code} > crashes with the exception: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) > {code} > If schema is not specified, the following exception is thrown: > {code:java} > java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) > at > scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) > at > scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25387) Malformed CSV causes NPE
[ https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25387: Assignee: (was: Apache Spark) > Malformed CSV causes NPE > > > Key: SPARK-25387 > URL: https://issues.apache.org/jira/browse/SPARK-25387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > Loading a malformed CSV files or a dataset can cause NullPointerException, > for example the code: > {code:scala} > val schema = StructType(StructField("a", IntegerType) :: Nil) > val input = spark.createDataset(Seq("\u\u\u0001234")) > spark.read.schema(schema).csv(input).collect() > {code} > crashes with the exception: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) > {code} > If schema is not specified, the following exception is thrown: > {code:java} > java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) > at > scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) > at > scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25387) Malformed CSV causes NPE
Maxim Gekk created SPARK-25387: -- Summary: Malformed CSV causes NPE Key: SPARK-25387 URL: https://issues.apache.org/jira/browse/SPARK-25387 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Loading a malformed CSV files or a dataset can cause NullPointerException, for example the code: {code:scala} val schema = StructType(StructField("a", IntegerType) :: Nil) val input = spark.createDataset(Seq("\u\u\u0001234")) spark.read.schema(schema).csv(input).collect() {code} crashes with the exception: {code:java} Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) at org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) at org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) {code} If schema is not specified, the following exception is thrown: {code:java} java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) at scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) at scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25371) Vector Assembler with no input columns leads to opaque error
[ https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608444#comment-16608444 ] Apache Spark commented on SPARK-25371: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22373 > Vector Assembler with no input columns leads to opaque error > > > Key: SPARK-25371 > URL: https://issues.apache.org/jira/browse/SPARK-25371 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0, 2.3.1 >Reporter: Victor Alor >Priority: Trivial > > When `VectorAssembler ` is given an empty array as its inputColumns it throws > an opaque error. In versions less than 2.3 `VectorAssembler` it simply > appends a column containing empty vectors. > > {code:java} > val inputCols = Array() > val outputCols = Array("A") > val vectorAssembler = new VectorAssembler() > .setInputCols(inputCols) > .setOutputCol(outputCols) > vectorAssmbler.fit(data).transform(df) > {code} > In versions 2.3 > this throws the exception below > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due > to data type mismatch: input to function named_struct requires at least one > argument;; > {code} > Whereas in versions less than 2.3 it just adds a column containing an empty > vector. > I'm not certain if this is an intentional choice or an actual bug. If this is > a bug, the `VectorAssembler` should be modified to append an empty vector > column if it detects no inputCols. > > If it is a design decision it would be nice to throw a human readable > exception explicitly stating inputColumns must not be empty. The current > error is somewhat opaque. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25371) Vector Assembler with no input columns leads to opaque error
[ https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25371: Assignee: Apache Spark > Vector Assembler with no input columns leads to opaque error > > > Key: SPARK-25371 > URL: https://issues.apache.org/jira/browse/SPARK-25371 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0, 2.3.1 >Reporter: Victor Alor >Assignee: Apache Spark >Priority: Trivial > > When `VectorAssembler ` is given an empty array as its inputColumns it throws > an opaque error. In versions less than 2.3 `VectorAssembler` it simply > appends a column containing empty vectors. > > {code:java} > val inputCols = Array() > val outputCols = Array("A") > val vectorAssembler = new VectorAssembler() > .setInputCols(inputCols) > .setOutputCol(outputCols) > vectorAssmbler.fit(data).transform(df) > {code} > In versions 2.3 > this throws the exception below > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due > to data type mismatch: input to function named_struct requires at least one > argument;; > {code} > Whereas in versions less than 2.3 it just adds a column containing an empty > vector. > I'm not certain if this is an intentional choice or an actual bug. If this is > a bug, the `VectorAssembler` should be modified to append an empty vector > column if it detects no inputCols. > > If it is a design decision it would be nice to throw a human readable > exception explicitly stating inputColumns must not be empty. The current > error is somewhat opaque. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25371) Vector Assembler with no input columns leads to opaque error
[ https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25371: Assignee: (was: Apache Spark) > Vector Assembler with no input columns leads to opaque error > > > Key: SPARK-25371 > URL: https://issues.apache.org/jira/browse/SPARK-25371 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0, 2.3.1 >Reporter: Victor Alor >Priority: Trivial > > When `VectorAssembler ` is given an empty array as its inputColumns it throws > an opaque error. In versions less than 2.3 `VectorAssembler` it simply > appends a column containing empty vectors. > > {code:java} > val inputCols = Array() > val outputCols = Array("A") > val vectorAssembler = new VectorAssembler() > .setInputCols(inputCols) > .setOutputCol(outputCols) > vectorAssmbler.fit(data).transform(df) > {code} > In versions 2.3 > this throws the exception below > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due > to data type mismatch: input to function named_struct requires at least one > argument;; > {code} > Whereas in versions less than 2.3 it just adds a column containing an empty > vector. > I'm not certain if this is an intentional choice or an actual bug. If this is > a bug, the `VectorAssembler` should be modified to append an empty vector > column if it detects no inputCols. > > If it is a design decision it would be nice to throw a human readable > exception explicitly stating inputColumns must not be empty. The current > error is somewhat opaque. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25371) Vector Assembler with no input columns leads to opaque error
[ https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608443#comment-16608443 ] Apache Spark commented on SPARK-25371: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22373 > Vector Assembler with no input columns leads to opaque error > > > Key: SPARK-25371 > URL: https://issues.apache.org/jira/browse/SPARK-25371 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0, 2.3.1 >Reporter: Victor Alor >Priority: Trivial > > When `VectorAssembler ` is given an empty array as its inputColumns it throws > an opaque error. In versions less than 2.3 `VectorAssembler` it simply > appends a column containing empty vectors. > > {code:java} > val inputCols = Array() > val outputCols = Array("A") > val vectorAssembler = new VectorAssembler() > .setInputCols(inputCols) > .setOutputCol(outputCols) > vectorAssmbler.fit(data).transform(df) > {code} > In versions 2.3 > this throws the exception below > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due > to data type mismatch: input to function named_struct requires at least one > argument;; > {code} > Whereas in versions less than 2.3 it just adds a column containing an empty > vector. > I'm not certain if this is an intentional choice or an actual bug. If this is > a bug, the `VectorAssembler` should be modified to append an empty vector > column if it detects no inputCols. > > If it is a design decision it would be nice to throw a human readable > exception explicitly stating inputColumns must not be empty. The current > error is somewhat opaque. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25385) Upgrade jackson version to 2.7.8
[ https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608430#comment-16608430 ] Apache Spark commented on SPARK-25385: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22372 > Upgrade jackson version to 2.7.8 > > > Key: SPARK-25385 > URL: https://issues.apache.org/jira/browse/SPARK-25385 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > This upgrade to fix {{JsonMappingException}}: > {noformat} > export SPARK_PREPEND_CLASSES=true > build/sbt clean package -Phadoop-3.1 > spark-shell > scala> spark.range(10).write.parquet("/tmp/spark/parquet") > com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson > version: 2.7.8 > at > com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64) > at > com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19) > at > com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25385) Upgrade jackson version to 2.7.8
[ https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608429#comment-16608429 ] Apache Spark commented on SPARK-25385: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22372 > Upgrade jackson version to 2.7.8 > > > Key: SPARK-25385 > URL: https://issues.apache.org/jira/browse/SPARK-25385 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > This upgrade to fix {{JsonMappingException}}: > {noformat} > export SPARK_PREPEND_CLASSES=true > build/sbt clean package -Phadoop-3.1 > spark-shell > scala> spark.range(10).write.parquet("/tmp/spark/parquet") > com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson > version: 2.7.8 > at > com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64) > at > com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19) > at > com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25385) Upgrade jackson version to 2.7.8
[ https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25385: Assignee: Apache Spark > Upgrade jackson version to 2.7.8 > > > Key: SPARK-25385 > URL: https://issues.apache.org/jira/browse/SPARK-25385 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > This upgrade to fix {{JsonMappingException}}: > {noformat} > export SPARK_PREPEND_CLASSES=true > build/sbt clean package -Phadoop-3.1 > spark-shell > scala> spark.range(10).write.parquet("/tmp/spark/parquet") > com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson > version: 2.7.8 > at > com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64) > at > com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19) > at > com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25385) Upgrade jackson version to 2.7.8
[ https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25385: Assignee: (was: Apache Spark) > Upgrade jackson version to 2.7.8 > > > Key: SPARK-25385 > URL: https://issues.apache.org/jira/browse/SPARK-25385 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > This upgrade to fix {{JsonMappingException}}: > {noformat} > export SPARK_PREPEND_CLASSES=true > build/sbt clean package -Phadoop-3.1 > spark-shell > scala> spark.range(10).write.parquet("/tmp/spark/parquet") > com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson > version: 2.7.8 > at > com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64) > at > com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19) > at > com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit call
[ https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyang Liu updated SPARK-25386: - Summary: Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit call (was: Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit) > Don't need to synchronize the IndexShuffleBlockResolver for each > writeIndexFileAndCommit call > - > > Key: SPARK-25386 > URL: https://issues.apache.org/jira/browse/SPARK-25386 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Xianyang Liu >Priority: Major > > Now, we need synchronize the instance of IndexShuffleBlockResolver in order > to make the commit check and tmp file rename atomically. This can be > improved. We could synchronize a lock which is different for each `shuffleId > + mapId` instead of synchronize the indexShuffleBlockResolver for each > writeIndexFileAndCommit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit call
[ https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyang Liu updated SPARK-25386: - Description: Now, we need synchronize the instance of IndexShuffleBlockResolver in order to make the commit check and tmp file rename atomically. This can be improved. We could synchronize a lock which is different for each `shuffleId + mapId` instead of synchronize the indexShuffleBlockResolver for each writeIndexFileAndCommit call. (was: Now, we need synchronize the instance of IndexShuffleBlockResolver in order to make the commit check and tmp file rename atomically. This can be improved. We could synchronize a lock which is different for each `shuffleId + mapId` instead of synchronize the indexShuffleBlockResolver for each writeIndexFileAndCommit.) > Don't need to synchronize the IndexShuffleBlockResolver for each > writeIndexFileAndCommit call > - > > Key: SPARK-25386 > URL: https://issues.apache.org/jira/browse/SPARK-25386 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Xianyang Liu >Priority: Major > > Now, we need synchronize the instance of IndexShuffleBlockResolver in order > to make the commit check and tmp file rename atomically. This can be > improved. We could synchronize a lock which is different for each `shuffleId > + mapId` instead of synchronize the indexShuffleBlockResolver for each > writeIndexFileAndCommit call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit
[ https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608427#comment-16608427 ] Apache Spark commented on SPARK-25386: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/22371 > Don't need to synchronize the IndexShuffleBlockResolver for each > writeIndexFileAndCommit > > > Key: SPARK-25386 > URL: https://issues.apache.org/jira/browse/SPARK-25386 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Xianyang Liu >Priority: Major > > Now, we need synchronize the instance of IndexShuffleBlockResolver in order > to make the commit check and tmp file rename atomically. This can be > improved. We could synchronize a lock which is different for each `shuffleId > + mapId` instead of synchronize the indexShuffleBlockResolver for each > writeIndexFileAndCommit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit
[ https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25386: Assignee: (was: Apache Spark) > Don't need to synchronize the IndexShuffleBlockResolver for each > writeIndexFileAndCommit > > > Key: SPARK-25386 > URL: https://issues.apache.org/jira/browse/SPARK-25386 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Xianyang Liu >Priority: Major > > Now, we need synchronize the instance of IndexShuffleBlockResolver in order > to make the commit check and tmp file rename atomically. This can be > improved. We could synchronize a lock which is different for each `shuffleId > + mapId` instead of synchronize the indexShuffleBlockResolver for each > writeIndexFileAndCommit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit
[ https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25386: Assignee: Apache Spark > Don't need to synchronize the IndexShuffleBlockResolver for each > writeIndexFileAndCommit > > > Key: SPARK-25386 > URL: https://issues.apache.org/jira/browse/SPARK-25386 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Xianyang Liu >Assignee: Apache Spark >Priority: Major > > Now, we need synchronize the instance of IndexShuffleBlockResolver in order > to make the commit check and tmp file rename atomically. This can be > improved. We could synchronize a lock which is different for each `shuffleId > + mapId` instead of synchronize the indexShuffleBlockResolver for each > writeIndexFileAndCommit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit
[ https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608426#comment-16608426 ] Apache Spark commented on SPARK-25386: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/22371 > Don't need to synchronize the IndexShuffleBlockResolver for each > writeIndexFileAndCommit > > > Key: SPARK-25386 > URL: https://issues.apache.org/jira/browse/SPARK-25386 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Xianyang Liu >Priority: Major > > Now, we need synchronize the instance of IndexShuffleBlockResolver in order > to make the commit check and tmp file rename atomically. This can be > improved. We could synchronize a lock which is different for each `shuffleId > + mapId` instead of synchronize the indexShuffleBlockResolver for each > writeIndexFileAndCommit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25385) Upgrade jackson version to 2.7.8
[ https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25385: Summary: Upgrade jackson version to 2.7.8 (was: Upgrade fasterxml.jackson.databind.version to 2.7.8) > Upgrade jackson version to 2.7.8 > > > Key: SPARK-25385 > URL: https://issues.apache.org/jira/browse/SPARK-25385 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > This upgrade to fix {{JsonMappingException}}: > {noformat} > export SPARK_PREPEND_CLASSES=true > build/sbt clean package -Phadoop-3.1 > spark-shell > scala> spark.range(10).write.parquet("/tmp/spark/parquet") > com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson > version: 2.7.8 > at > com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64) > at > com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19) > at > com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82) > at > org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit
Xianyang Liu created SPARK-25386: Summary: Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit Key: SPARK-25386 URL: https://issues.apache.org/jira/browse/SPARK-25386 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Xianyang Liu Now, we need synchronize the instance of IndexShuffleBlockResolver in order to make the commit check and tmp file rename atomically. This can be improved. We could synchronize a lock which is different for each `shuffleId + mapId` instead of synchronize the indexShuffleBlockResolver for each writeIndexFileAndCommit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25385) Upgrade fasterxml.jackson.databind.version to 2.7.8
Yuming Wang created SPARK-25385: --- Summary: Upgrade fasterxml.jackson.databind.version to 2.7.8 Key: SPARK-25385 URL: https://issues.apache.org/jira/browse/SPARK-25385 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 2.4.0 Reporter: Yuming Wang This upgrade to fix {{JsonMappingException}}: {noformat} export SPARK_PREPEND_CLASSES=true build/sbt clean package -Phadoop-3.1 spark-shell scala> spark.range(10).write.parquet("/tmp/spark/parquet") com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.7.8 at com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64) at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19) at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730) at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82) at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org