[jira] [Assigned] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25391:


Assignee: Apache Spark

> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25391:


Assignee: (was: Apache Spark)

> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608750#comment-16608750
 ] 

Apache Spark commented on SPARK-25391:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22343

> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-09 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25391:


 Summary: Make behaviors consistent when converting parquet hive 
table to parquet data source
 Key: SPARK-25391
 URL: https://issues.apache.org/jira/browse/SPARK-25391
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


parquet data source tables and hive parquet tables have different behaviors 
about parquet field resolution. So, when 
{{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
inconsistent behaviors. The differences are:
 * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both data 
source tables and hive tables do NOT respect {{spark.sql.caseSensitive}}. 
However data source tables always do case-sensitive parquet field resolution, 
while hive tables always do case-insensitive parquet field resolution no matter 
whether {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let 
data source tables respect {{spark.sql.caseSensitive}} while hive serde table 
behavior is not changed.
 * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, data 
source tables do case-sensitive resolution and return columns with the 
corresponding letter cases, while hive tables always return the first matched 
column ignoring cases. SPARK-25132 let data source tables throw exception when 
there is ambiguity while hive table behavior is not changed.

This ticket aims to make behaviors consistent when converting hive table to 
data source table.
 * The behavior must be consistent to do the conversion, so we skip the 
conversion in case-sensitive mode because hive parquet table always do 
case-insensitive field resolution.
 * In case-insensitive mode, when converting hive parquet table to parquet data 
source, we switch the duplicated fields resolution mode to ask parquet data 
source to pick the first matched field - the same behavior as hive parquet 
table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25390) finalize the abstraction of data source V2 API

2018-09-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25390:
---

 Summary: finalize the abstraction of data source V2 API
 Key: SPARK-25390
 URL: https://issues.apache.org/jira/browse/SPARK-25390
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


Currently it's not very clear how we should abstract data source v2 API. The 
abstraction should be unified between batch and streaming, or similar but have 
a well-defined difference between batch and streaming. And the abstraction 
should also include catalog/table.

An example of the abstraction:
{code}
batch: catalog -> table -> scan
streaming: catalog -> table -> stream -> scan
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields

2018-09-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25389:
--
Description: 
Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
STORED AS` should not generate files with duplicate fields because Spark cannot 
read those files.

*INSERT OVERWRITE DIRECTORY USING*
{code}
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
inserting into file:/tmp/parquet: `id`;
{code}

*INSERT OVERWRITE DIRECTORY STORED AS*
{code}
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet 
SELECT 'id', 'id2' id")

scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema 
and the partition schema: `id`;
{code}

  was:
Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
STORED AS` should not generate files with duplicate fields because Spark cannot 
read those files.

*INSERT OVERWRITE DIRECTORY USING*
{code}
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
SELECT 'id', 'id2' id")
18/09/09 22:11:29 ERROR InsertIntoDataSourceDirCommand: Failed to write to 
directory Some(file:///tmp/parquet)
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
inserting into file:/tmp/parquet: `id`;
{code}

*INSERT OVERWRITE DIRECTORY STORED AS*
{code}
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet 
SELECT 'id', 'id2' id")

scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema 
and the partition schema: `id`;
{code}


> INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
> 
>
> Key: SPARK-25389
> URL: https://issues.apache.org/jira/browse/SPARK-25389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
> STORED AS` should not generate files with duplicate fields because Spark 
> cannot read those files.
> *INSERT OVERWRITE DIRECTORY USING*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
> SELECT 'id', 'id2' id")
> ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
> inserting into file:/tmp/parquet: `id`;
> {code}
> *INSERT OVERWRITE DIRECTORY STORED AS*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
> parquet SELECT 'id', 'id2' id")
> scala> spark.read.parquet("/tmp/parquet").show
> 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
> schema and the partition schema: `id`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25389:


Assignee: (was: Apache Spark)

> INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
> 
>
> Key: SPARK-25389
> URL: https://issues.apache.org/jira/browse/SPARK-25389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
> STORED AS` should not generate files with duplicate fields because Spark 
> cannot read those files.
> *INSERT OVERWRITE DIRECTORY USING*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
> SELECT 'id', 'id2' id")
> ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
> inserting into file:/tmp/parquet: `id`;
> {code}
> *INSERT OVERWRITE DIRECTORY STORED AS*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
> parquet SELECT 'id', 'id2' id")
> scala> spark.read.parquet("/tmp/parquet").show
> 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
> schema and the partition schema: `id`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608741#comment-16608741
 ] 

Apache Spark commented on SPARK-25389:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22378

> INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
> 
>
> Key: SPARK-25389
> URL: https://issues.apache.org/jira/browse/SPARK-25389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
> STORED AS` should not generate files with duplicate fields because Spark 
> cannot read those files.
> *INSERT OVERWRITE DIRECTORY USING*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
> SELECT 'id', 'id2' id")
> ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
> inserting into file:/tmp/parquet: `id`;
> {code}
> *INSERT OVERWRITE DIRECTORY STORED AS*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
> parquet SELECT 'id', 'id2' id")
> scala> spark.read.parquet("/tmp/parquet").show
> 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
> schema and the partition schema: `id`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25389:


Assignee: Apache Spark

> INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
> 
>
> Key: SPARK-25389
> URL: https://issues.apache.org/jira/browse/SPARK-25389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
> STORED AS` should not generate files with duplicate fields because Spark 
> cannot read those files.
> *INSERT OVERWRITE DIRECTORY USING*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
> SELECT 'id', 'id2' id")
> ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
> inserting into file:/tmp/parquet: `id`;
> {code}
> *INSERT OVERWRITE DIRECTORY STORED AS*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
> parquet SELECT 'id', 'id2' id")
> scala> spark.read.parquet("/tmp/parquet").show
> 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
> schema and the partition schema: `id`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields

2018-09-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25389:
--
Summary: INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate 
fields  (was: INSERT OVERWRITE DIRECTORY STORED AS should not generate files 
with duplicate fields)

> INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
> 
>
> Key: SPARK-25389
> URL: https://issues.apache.org/jira/browse/SPARK-25389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
> STORED AS` should not generate files with duplicate fields because Spark 
> cannot read those files.
> *INSERT OVERWRITE DIRECTORY USING*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
> SELECT 'id', 'id2' id")
> 18/09/09 22:11:29 ERROR InsertIntoDataSourceDirCommand: Failed to write to 
> directory Some(file:///tmp/parquet)
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
> inserting into file:/tmp/parquet: `id`;
> {code}
> *INSERT OVERWRITE DIRECTORY STORED AS*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
> parquet SELECT 'id', 'id2' id")
> scala> spark.read.parquet("/tmp/parquet").show
> 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
> schema and the partition schema: `id`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should not generate files with duplicate fields

2018-09-09 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25389:
-

 Summary: INSERT OVERWRITE DIRECTORY STORED AS should not generate 
files with duplicate fields
 Key: SPARK-25389
 URL: https://issues.apache.org/jira/browse/SPARK-25389
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Dongjoon Hyun


Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
STORED AS` should not generate files with duplicate fields because Spark cannot 
read those files.

*INSERT OVERWRITE DIRECTORY USING*
{code}
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
SELECT 'id', 'id2' id")
18/09/09 22:11:29 ERROR InsertIntoDataSourceDirCommand: Failed to write to 
directory Some(file:///tmp/parquet)
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
inserting into file:/tmp/parquet: `id`;
{code}

*INSERT OVERWRITE DIRECTORY STORED AS*
{code}
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet 
SELECT 'id', 'id2' id")

scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema 
and the partition schema: `id`;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608701#comment-16608701
 ] 

Apache Spark commented on SPARK-24911:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22377

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608697#comment-16608697
 ] 

Apache Spark commented on SPARK-24911:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22377

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608698#comment-16608698
 ] 

Apache Spark commented on SPARK-24911:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22377

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608699#comment-16608699
 ] 

Apache Spark commented on SPARK-24911:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22377

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608696#comment-16608696
 ] 

Apache Spark commented on SPARK-24849:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22377

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608695#comment-16608695
 ] 

Apache Spark commented on SPARK-24849:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22377

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20918) Use FunctionIdentifier as function identifiers in FunctionRegistry

2018-09-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20918:

Description: 
Currently, the unquoted string of a function identifier is being used as the 
function identifier in the function registry. This could cause the incorrect 
the behavior when users use `.` in the function names. 

As an example, Spark can resolve a function like this
{code}
SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
{code}

Although the function name is wrapped with backticks, Spark still resolves it 
as database name + function name, which is wrong.

  was:Currently, the unquoted string of a function identifier is being used as 
the function identifier in the function registry. This could cause the 
incorrect the behavior when users use `.` in the function names. 


> Use FunctionIdentifier as function identifiers in FunctionRegistry
> --
>
> Key: SPARK-20918
> URL: https://issues.apache.org/jira/browse/SPARK-20918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, the unquoted string of a function identifier is being used as the 
> function identifier in the function registry. This could cause the incorrect 
> the behavior when users use `.` in the function names. 
> As an example, Spark can resolve a function like this
> {code}
> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
> {code}
> Although the function name is wrapped with backticks, Spark still resolves it 
> as database name + function name, which is wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20918) Use FunctionIdentifier as function identifiers in FunctionRegistry

2018-09-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20918:

Issue Type: Bug  (was: Improvement)

> Use FunctionIdentifier as function identifiers in FunctionRegistry
> --
>
> Key: SPARK-20918
> URL: https://issues.apache.org/jira/browse/SPARK-20918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, the unquoted string of a function identifier is being used as the 
> function identifier in the function registry. This could cause the incorrect 
> the behavior when users use `.` in the function names. 
> As an example, Spark can resolve a function like this
> {code}
> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
> {code}
> Although the function name is wrapped with backticks, Spark still resolves it 
> as database name + function name, which is wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) The column attributes obtained by Spark sql are inconsistent with hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Description: 
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:
{code:java}
val df = spark.read.json("examples/src/main/resources/people.json");
df.write.format("orc").saveAsTable("people_test");
spark.sql("desc people_test").show()

++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
hive:
{code:java}
hive> desc people_test;
OK
age bigint 
name string 
Time taken: 0.454 seconds, Fetched: 2 row(s)
hive> alter table people_test change column age age double;
OK
Time taken: 0.68 seconds
hive> desc people_test;
OK
age double 
name string 
Time taken: 0.358 seconds, Fetched: 2 row(s){code}
spark-shell:
{code:java}
spark.catalog.refreshTable("people_test")
spark.sql("desc people_test").show()
++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.

  was:
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:
{code:java}
val df = spark.read.json("examples/src/main/resources/people.json");
df.write.format("orc").saveAsTable("people_test");
spark.sql("desc people_test").show()

++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
hive:
{code:java}
hive> desc people_test;
OK
age bigint 
name string 
Time taken: 0.454 seconds, Fetched: 2 row(s)
hive> alter table people_test change column age age1 double;
OK
Time taken: 0.68 seconds
hive> desc people_test;
OK
age1 double 
name string 
Time taken: 0.358 seconds, Fetched: 2 row(s){code}
spark-shell:
{code:java}
spark.catalog.refreshTable("people_test")
spark.sql("desc people_test").show()
++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.


> The column attributes obtained by Spark sql are inconsistent with hive
> --
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> 

[jira] [Updated] (SPARK-25367) The column attributes obtained by Spark sql are inconsistent with hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Priority: Critical  (was: Major)

> The column attributes obtained by Spark sql are inconsistent with hive
> --
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Critical
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608676#comment-16608676
 ] 

Apache Spark commented on SPARK-25021:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22376

> Add spark.executor.pyspark.memory support to Kubernetes
> ---
>
> Key: SPARK-25021
> URL: https://issues.apache.org/jira/browse/SPARK-25021
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
> allocation for PySpark and updates YARN to add this memory to its container 
> requests. Kubernetes should do something similar to account for the python 
> memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray assume UTF8String

2018-09-09 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608675#comment-16608675
 ] 

Wenchen Fan commented on SPARK-25378:
-

ArrayData is not a public interface, so I won't treat it as a breaking change. 
I'd say there is a bug in 2.3.1 that allows it, and the current behavior is 
expected.

I understand how hard it is to write a connector without using internal APIs, 
hopefully the connectors don't need to rely on internal APIs anymore after we 
stabilize the data source v2.

Shall we close it as "not a problem"?

> ArrayData.toArray assume UTF8String
> ---
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) The column attributes obtained by Spark sql are inconsistent with hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Summary: The column attributes obtained by Spark sql are inconsistent with 
hive  (was: Spark sql get incompatiable column schema as in hiveThe column 
attributes obtained by Spark sql are inconsistent with hive)

> The column attributes obtained by Spark sql are inconsistent with hive
> --
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) Spark sql get incompatiable column schema as in hiveThe column attributes obtained by Spark sql are inconsistent with hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Summary: Spark sql get incompatiable column schema as in hiveThe column 
attributes obtained by Spark sql are inconsistent with hive  (was: Spark sql 
get incompatiable column schema as in hive)

> Spark sql get incompatiable column schema as in hiveThe column attributes 
> obtained by Spark sql are inconsistent with hive
> --
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) Spark sql get incompatiable column schema as in hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Summary: Spark sql get incompatiable column schema as in hive  (was: Spark 
sql get incompatiable schema as in hive)

> Spark sql get incompatiable column schema as in hive
> 
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) Spark sql get incompatiable schema as in hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Summary: Spark sql get incompatiable schema as in hive  (was: Spark sql got 
incompatiable schema as in hive)

> Spark sql get incompatiable schema as in hive
> -
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) Spark sql got incompatiable schema as in hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Summary: Spark sql got incompatiable schema as in hive  (was: Hive table 
created by Spark dataFrame has incompatiable schema in spark and hive)

> Spark sql got incompatiable schema as in hive
> -
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Fix Version/s: (was: 2.3.2)

> Hive table created by Spark dataFrame has incompatiable schema in spark and 
> hive
> 
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Description: 
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:
{code:java}
val df = spark.read.json("examples/src/main/resources/people.json");
df.write.format("orc").saveAsTable("people_test");
spark.sql("desc people_test").show()

++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
hive:
{code:java}
hive> desc people_test;
OK
age bigint 
name string 
Time taken: 0.454 seconds, Fetched: 2 row(s)
hive> alter table people_test change column age age1 double;
OK
Time taken: 0.68 seconds
hive> desc people_test;
OK
age1 double 
name string 
Time taken: 0.358 seconds, Fetched: 2 row(s){code}
spark-shell:
{code:java}
spark.catalog.refreshTable("people_test")
spark.sql("desc people_test").show()
++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.

  was:
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:
{code:java}
val df = spark.read.json("examples/src/main/resources/people.json");
df.write.format("orc").saveAsTable("people_test");
spark.catalog.refreshTable("people_test")
spark.sql("desc people_test").show()

++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
 

hive:
{code:java}
hive> desc people_test;
OK
age bigint 
name string 
Time taken: 0.454 seconds, Fetched: 2 row(s)
hive> alter table people_test change column age age1 double;
OK
Time taken: 0.68 seconds
hive> desc people_test;
OK
age1 double 
name string 
Time taken: 0.358 seconds, Fetched: 2 row(s){code}
spark-shell:
{code:java}
spark.sql("desc people_test").show()
++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.


> Hive table created by Spark dataFrame has incompatiable schema in spark and 
> hive
> 
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
> Fix For: 2.3.2
>
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> 

[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Description: 
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:
{code:java}
val df = spark.read.json("examples/src/main/resources/people.json");
df.write.format("orc").saveAsTable("people_test");
spark.catalog.refreshTable("people_test")
spark.sql("desc people_test").show()

++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
 

hive:
{code:java}
hive> desc people_test;
OK
age bigint 
name string 
Time taken: 0.454 seconds, Fetched: 2 row(s)
hive> alter table people_test change column age age1 double;
OK
Time taken: 0.68 seconds
hive> desc people_test;
OK
age1 double 
name string 
Time taken: 0.358 seconds, Fetched: 2 row(s){code}
spark-shell:
{code:java}
spark.sql("desc people_test").show()
++-+---+
|col_name|data_type|comment|
++-+---+
| age| bigint| null|
| name| string| null|
++-+---+
{code}
 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.

  was:
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:
{code:java}
val df = spark.read.json("examples/src/main/resources/people.json");
df.write.format("orc").saveAsTable("people_test");
spark.catalog.refreshTable("people_test")
spark.sql("desc people").show()
{code}
 

hive:

 
{code:java}
alter table people_test change column age age1 double;
desc people_test;{code}
spark-shell:
{code:java}
spark.sql("desc people").show()
{code}
 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.


> Hive table created by Spark dataFrame has incompatiable schema in spark and 
> hive
> 
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
> Fix For: 2.3.2
>
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age1 double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age1 double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Description: 
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:
{code:java}
val df = spark.read.json("examples/src/main/resources/people.json");
df.write.format("orc").saveAsTable("people_test");
spark.catalog.refreshTable("people_test")
spark.sql("desc people").show()
{code}
 

hive:

 
{code:java}
alter table people_test change column age age1 double;
desc people_test;{code}
spark-shell:
{code:java}
spark.sql("desc people").show()
{code}
 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.

  was:
We save the dataframe object as a hive table in orc/parquet format in the spark 
shell.
 After we modified the column type (int to double) of this table in hive jdbc, 
we  found the column type queried in spark-shell didn't change, but changed in 
hive jdbc. After we restarted the spark-shell, this table's column type is 
still incompatible as showed in hive jdbc.

The coding process are as follows:

spark-shell:

val df = spark.read.json("examples/src/main/resources/people.json");
 df.write.format("orc").saveAsTable("people_test");
 spark.catalog.refreshTable("people_test")
 spark.sql("desc people").show()

hive:

alter table people_test change column age age1 double;

desc people_test;

spark-shell:

spark.sql("desc people").show()

 

We also tested in spark-shell by creating a table using spark.sql("create table 
XXX()"),  the modified columns are consistent.


> Hive table created by Spark dataFrame has incompatiable schema in spark and 
> hive
> 
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
> Fix For: 2.3.2
>
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people").show()
> {code}
>  
> hive:
>  
> {code:java}
> alter table people_test change column age age1 double;
> desc people_test;{code}
> spark-shell:
> {code:java}
> spark.sql("desc people").show()
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25367) Hive table created by Spark dataFrame has incompatiable schema in spark and hive

2018-09-09 Thread yy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yy updated SPARK-25367:
---
Affects Version/s: 2.2.0
   2.2.2
   2.3.0
   2.3.1
 Target Version/s: 2.3.1
  Environment: 
spark2.2.1-hadoop-2.6.0-chd-5.4.2

hive-1.2.1

  was:
spark2.2.1

hive1.2.1

Fix Version/s: 2.3.2
  Component/s: SQL

> Hive table created by Spark dataFrame has incompatiable schema in spark and 
> hive
> 
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
> Fix For: 2.3.2
>
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> val df = spark.read.json("examples/src/main/resources/people.json");
>  df.write.format("orc").saveAsTable("people_test");
>  spark.catalog.refreshTable("people_test")
>  spark.sql("desc people").show()
> hive:
> alter table people_test change column age age1 double;
> desc people_test;
> spark-shell:
> spark.sql("desc people").show()
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-09-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25175:
--
Fix Version/s: (was: 3.0.0)

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-09-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25175:
-

Assignee: Chenxiao Mao

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-09-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25175.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22262
[https://github.com/apache/spark/pull/22262]

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 3.0.0, 2.4.0
>
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25388:


Assignee: (was: Apache Spark)

> checkEvaluation may miss incorrect nullable of DataType in the result
> -
>
> Key: SPARK-25388
> URL: https://issues.apache.org/jira/browse/SPARK-25388
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in 
> {{checkEvaluationWithUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25388:


Assignee: Apache Spark

> checkEvaluation may miss incorrect nullable of DataType in the result
> -
>
> Key: SPARK-25388
> URL: https://issues.apache.org/jira/browse/SPARK-25388
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in 
> {{checkEvaluationWithUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608624#comment-16608624
 ] 

Apache Spark commented on SPARK-25388:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22375

> checkEvaluation may miss incorrect nullable of DataType in the result
> -
>
> Key: SPARK-25388
> URL: https://issues.apache.org/jira/browse/SPARK-25388
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in 
> {{checkEvaluationWithUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result

2018-09-09 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25388:


 Summary: checkEvaluation may miss incorrect nullable of DataType 
in the result
 Key: SPARK-25388
 URL: https://issues.apache.org/jira/browse/SPARK-25388
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in 
{{checkEvaluationWithUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25366) Zstd and brotli CompressionCodec are not supported for parquet files

2018-09-09 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25366:

Summary: Zstd and brotli CompressionCodec are  not supported for parquet 
files  (was: Zstd and brotil CompressionCodec are  not supported for parquet 
files)

> Zstd and brotli CompressionCodec are  not supported for parquet files
> -
>
> Key: SPARK-25366
> URL: https://issues.apache.org/jira/browse/SPARK-25366
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
> org.apache.hadoop.io.compress.*BrotliCodec* was not found
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
>     at 
> org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
>     at 
> org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
>     at 
> org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
>     
>     
>     
>     
>     
> Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
> org.apache.hadoop.io.compress.*{color:#33}ZStandardCodec{color}* was not 
> found
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
>     at 
> org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
>     at 
> org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
>     at 
> org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-09 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608612#comment-16608612
 ] 

Vincent commented on SPARK-25364:
-

duplication. close this Jira.

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25364
> URL: https://issues.apache.org/jira/browse/SPARK-25364
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-09 Thread Vincent (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent resolved SPARK-25364.
-
Resolution: Duplicate

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25364
> URL: https://issues.apache.org/jira/browse/SPARK-25364
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25368) Incorrect constraint inference returns wrong result

2018-09-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25368:
--
Affects Version/s: (was: 2.3.2)
   2.3.0

> Incorrect constraint inference returns wrong result
> ---
>
> Key: SPARK-25368
> URL: https://issues.apache.org/jira/browse/SPARK-25368
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Lev Katzav
>Assignee: Yuming Wang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
> Attachments: plan.txt
>
>
> there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)
> the following code recreates the problem
>  (it's a bit convoluted examples, I tried to simplify it as much as possible 
> from my code)
> {code:java}
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import spark.implicits._
> case class Data(a: Option[Int],b: String,c: Option[String],d: String)
> val df1 = spark.createDataFrame(Seq(
>Data(Some(1), "1", None, "1"),
>Data(None, "2", Some("2"), "2")
> ))
> val df2 = df1
> .where( $"a".isNotNull)
> .withColumn("e", lit(null).cast("string"))
> val columns = df2.columns.map(c => col(c))
> val df3 = df1
> .select(
>   $"c",
>   $"b" as "e"
>   )
>   .withColumn("a", lit(null).cast("int"))
>   .withColumn("b", lit(null).cast("string"))
>   .withColumn("d", lit(null).cast("string"))
>   .select(columns :_*)
> val df4 =
>   df2.union(df3)
>   .withColumn("e", last(col("e"), ignoreNulls = 
> true).over(Window.partitionBy($"c").orderBy($"d")))
>   .filter($"a".isNotNull)
> df4.show
> {code}
>  
> notice that the last statement in for df4 is to filter rows where a is null
> in spark 2.2.1, the above code prints:
> {code:java}
> +---+---++---+---+ 
> | a| b| c| d| e|
>  +---+---++---+---+ 
> | 1| 1|null| 1| 1| 
> +---+---++---+---+
> {code}
> in spark 2.3.x, it prints: 
> {code:java}
> +++++---+ 
> | a| b| c| d| e| 
> +++++---+ 
> |null|null|null|null| 1| 
> | 1| 1|null| 1| 1| 
> |null|null| 2|null| 2|
>  +++++---+
> {code}
>  the column a still contains null values
>  
> attached are the plans.
> in the parsed logical plan, the filter for isnotnull('a), is on top,
>  but in the optimized logical plan, it is pushed down



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25368) Incorrect constraint inference returns wrong result

2018-09-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25368:

Summary: Incorrect constraint inference returns wrong result  (was: 
Incorrect predicate pushdown returns wrong result)

> Incorrect constraint inference returns wrong result
> ---
>
> Key: SPARK-25368
> URL: https://issues.apache.org/jira/browse/SPARK-25368
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Lev Katzav
>Assignee: Yuming Wang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
> Attachments: plan.txt
>
>
> there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)
> the following code recreates the problem
>  (it's a bit convoluted examples, I tried to simplify it as much as possible 
> from my code)
> {code:java}
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import spark.implicits._
> case class Data(a: Option[Int],b: String,c: Option[String],d: String)
> val df1 = spark.createDataFrame(Seq(
>Data(Some(1), "1", None, "1"),
>Data(None, "2", Some("2"), "2")
> ))
> val df2 = df1
> .where( $"a".isNotNull)
> .withColumn("e", lit(null).cast("string"))
> val columns = df2.columns.map(c => col(c))
> val df3 = df1
> .select(
>   $"c",
>   $"b" as "e"
>   )
>   .withColumn("a", lit(null).cast("int"))
>   .withColumn("b", lit(null).cast("string"))
>   .withColumn("d", lit(null).cast("string"))
>   .select(columns :_*)
> val df4 =
>   df2.union(df3)
>   .withColumn("e", last(col("e"), ignoreNulls = 
> true).over(Window.partitionBy($"c").orderBy($"d")))
>   .filter($"a".isNotNull)
> df4.show
> {code}
>  
> notice that the last statement in for df4 is to filter rows where a is null
> in spark 2.2.1, the above code prints:
> {code:java}
> +---+---++---+---+ 
> | a| b| c| d| e|
>  +---+---++---+---+ 
> | 1| 1|null| 1| 1| 
> +---+---++---+---+
> {code}
> in spark 2.3.x, it prints: 
> {code:java}
> +++++---+ 
> | a| b| c| d| e| 
> +++++---+ 
> |null|null|null|null| 1| 
> | 1| 1|null| 1| 1| 
> |null|null| 2|null| 2|
>  +++++---+
> {code}
>  the column a still contains null values
>  
> attached are the plans.
> in the parsed logical plan, the filter for isnotnull('a), is on top,
>  but in the optimized logical plan, it is pushed down



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25368) Incorrect predicate pushdown returns wrong result

2018-09-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25368:
---

Assignee: Yuming Wang

> Incorrect predicate pushdown returns wrong result
> -
>
> Key: SPARK-25368
> URL: https://issues.apache.org/jira/browse/SPARK-25368
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Lev Katzav
>Assignee: Yuming Wang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
> Attachments: plan.txt
>
>
> there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)
> the following code recreates the problem
>  (it's a bit convoluted examples, I tried to simplify it as much as possible 
> from my code)
> {code:java}
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import spark.implicits._
> case class Data(a: Option[Int],b: String,c: Option[String],d: String)
> val df1 = spark.createDataFrame(Seq(
>Data(Some(1), "1", None, "1"),
>Data(None, "2", Some("2"), "2")
> ))
> val df2 = df1
> .where( $"a".isNotNull)
> .withColumn("e", lit(null).cast("string"))
> val columns = df2.columns.map(c => col(c))
> val df3 = df1
> .select(
>   $"c",
>   $"b" as "e"
>   )
>   .withColumn("a", lit(null).cast("int"))
>   .withColumn("b", lit(null).cast("string"))
>   .withColumn("d", lit(null).cast("string"))
>   .select(columns :_*)
> val df4 =
>   df2.union(df3)
>   .withColumn("e", last(col("e"), ignoreNulls = 
> true).over(Window.partitionBy($"c").orderBy($"d")))
>   .filter($"a".isNotNull)
> df4.show
> {code}
>  
> notice that the last statement in for df4 is to filter rows where a is null
> in spark 2.2.1, the above code prints:
> {code:java}
> +---+---++---+---+ 
> | a| b| c| d| e|
>  +---+---++---+---+ 
> | 1| 1|null| 1| 1| 
> +---+---++---+---+
> {code}
> in spark 2.3.x, it prints: 
> {code:java}
> +++++---+ 
> | a| b| c| d| e| 
> +++++---+ 
> |null|null|null|null| 1| 
> | 1| 1|null| 1| 1| 
> |null|null| 2|null| 2|
>  +++++---+
> {code}
>  the column a still contains null values
>  
> attached are the plans.
> in the parsed logical plan, the filter for isnotnull('a), is on top,
>  but in the optimized logical plan, it is pushed down



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25368) Incorrect predicate pushdown returns wrong result

2018-09-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25368:

Labels: correctness  (was: )

> Incorrect predicate pushdown returns wrong result
> -
>
> Key: SPARK-25368
> URL: https://issues.apache.org/jira/browse/SPARK-25368
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Lev Katzav
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
> Attachments: plan.txt
>
>
> there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)
> the following code recreates the problem
>  (it's a bit convoluted examples, I tried to simplify it as much as possible 
> from my code)
> {code:java}
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import spark.implicits._
> case class Data(a: Option[Int],b: String,c: Option[String],d: String)
> val df1 = spark.createDataFrame(Seq(
>Data(Some(1), "1", None, "1"),
>Data(None, "2", Some("2"), "2")
> ))
> val df2 = df1
> .where( $"a".isNotNull)
> .withColumn("e", lit(null).cast("string"))
> val columns = df2.columns.map(c => col(c))
> val df3 = df1
> .select(
>   $"c",
>   $"b" as "e"
>   )
>   .withColumn("a", lit(null).cast("int"))
>   .withColumn("b", lit(null).cast("string"))
>   .withColumn("d", lit(null).cast("string"))
>   .select(columns :_*)
> val df4 =
>   df2.union(df3)
>   .withColumn("e", last(col("e"), ignoreNulls = 
> true).over(Window.partitionBy($"c").orderBy($"d")))
>   .filter($"a".isNotNull)
> df4.show
> {code}
>  
> notice that the last statement in for df4 is to filter rows where a is null
> in spark 2.2.1, the above code prints:
> {code:java}
> +---+---++---+---+ 
> | a| b| c| d| e|
>  +---+---++---+---+ 
> | 1| 1|null| 1| 1| 
> +---+---++---+---+
> {code}
> in spark 2.3.x, it prints: 
> {code:java}
> +++++---+ 
> | a| b| c| d| e| 
> +++++---+ 
> |null|null|null|null| 1| 
> | 1| 1|null| 1| 1| 
> |null|null| 2|null| 2|
>  +++++---+
> {code}
>  the column a still contains null values
>  
> attached are the plans.
> in the parsed logical plan, the filter for isnotnull('a), is on top,
>  but in the optimized logical plan, it is pushed down



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25368) Incorrect predicate pushdown returns wrong result

2018-09-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25368.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.2

> Incorrect predicate pushdown returns wrong result
> -
>
> Key: SPARK-25368
> URL: https://issues.apache.org/jira/browse/SPARK-25368
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Lev Katzav
>Assignee: Yuming Wang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
> Attachments: plan.txt
>
>
> there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)
> the following code recreates the problem
>  (it's a bit convoluted examples, I tried to simplify it as much as possible 
> from my code)
> {code:java}
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import spark.implicits._
> case class Data(a: Option[Int],b: String,c: Option[String],d: String)
> val df1 = spark.createDataFrame(Seq(
>Data(Some(1), "1", None, "1"),
>Data(None, "2", Some("2"), "2")
> ))
> val df2 = df1
> .where( $"a".isNotNull)
> .withColumn("e", lit(null).cast("string"))
> val columns = df2.columns.map(c => col(c))
> val df3 = df1
> .select(
>   $"c",
>   $"b" as "e"
>   )
>   .withColumn("a", lit(null).cast("int"))
>   .withColumn("b", lit(null).cast("string"))
>   .withColumn("d", lit(null).cast("string"))
>   .select(columns :_*)
> val df4 =
>   df2.union(df3)
>   .withColumn("e", last(col("e"), ignoreNulls = 
> true).over(Window.partitionBy($"c").orderBy($"d")))
>   .filter($"a".isNotNull)
> df4.show
> {code}
>  
> notice that the last statement in for df4 is to filter rows where a is null
> in spark 2.2.1, the above code prints:
> {code:java}
> +---+---++---+---+ 
> | a| b| c| d| e|
>  +---+---++---+---+ 
> | 1| 1|null| 1| 1| 
> +---+---++---+---+
> {code}
> in spark 2.3.x, it prints: 
> {code:java}
> +++++---+ 
> | a| b| c| d| e| 
> +++++---+ 
> |null|null|null|null| 1| 
> | 1| 1|null| 1| 1| 
> |null|null| 2|null| 2|
>  +++++---+
> {code}
>  the column a still contains null values
>  
> attached are the plans.
> in the parsed logical plan, the filter for isnotnull('a), is on top,
>  but in the optimized logical plan, it is pushed down



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25387) Malformed CSV causes NPE

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25387:


Assignee: Apache Spark

> Malformed CSV causes NPE
> 
>
> Key: SPARK-25387
> URL: https://issues.apache.org/jira/browse/SPARK-25387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Loading a malformed CSV files or a dataset can cause NullPointerException, 
> for example the code:
> {code:scala}
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> val input = spark.createDataset(Seq("\u\u\u0001234"))
> spark.read.schema(schema).csv(input).collect()
> {code} 
> crashes with the exception:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
> {code}
> If schema is not specified, the following exception is thrown:
> {code:java}
> java.lang.NullPointerException
>   at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
>   at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
>   at 
> scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25387) Malformed CSV causes NPE

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608476#comment-16608476
 ] 

Apache Spark commented on SPARK-25387:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22374

> Malformed CSV causes NPE
> 
>
> Key: SPARK-25387
> URL: https://issues.apache.org/jira/browse/SPARK-25387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> Loading a malformed CSV files or a dataset can cause NullPointerException, 
> for example the code:
> {code:scala}
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> val input = spark.createDataset(Seq("\u\u\u0001234"))
> spark.read.schema(schema).csv(input).collect()
> {code} 
> crashes with the exception:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
> {code}
> If schema is not specified, the following exception is thrown:
> {code:java}
> java.lang.NullPointerException
>   at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
>   at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
>   at 
> scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25387) Malformed CSV causes NPE

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608477#comment-16608477
 ] 

Apache Spark commented on SPARK-25387:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22374

> Malformed CSV causes NPE
> 
>
> Key: SPARK-25387
> URL: https://issues.apache.org/jira/browse/SPARK-25387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Loading a malformed CSV files or a dataset can cause NullPointerException, 
> for example the code:
> {code:scala}
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> val input = spark.createDataset(Seq("\u\u\u0001234"))
> spark.read.schema(schema).csv(input).collect()
> {code} 
> crashes with the exception:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
> {code}
> If schema is not specified, the following exception is thrown:
> {code:java}
> java.lang.NullPointerException
>   at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
>   at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
>   at 
> scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25387) Malformed CSV causes NPE

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25387:


Assignee: (was: Apache Spark)

> Malformed CSV causes NPE
> 
>
> Key: SPARK-25387
> URL: https://issues.apache.org/jira/browse/SPARK-25387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> Loading a malformed CSV files or a dataset can cause NullPointerException, 
> for example the code:
> {code:scala}
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> val input = spark.createDataset(Seq("\u\u\u0001234"))
> spark.read.schema(schema).csv(input).collect()
> {code} 
> crashes with the exception:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
> {code}
> If schema is not specified, the following exception is thrown:
> {code:java}
> java.lang.NullPointerException
>   at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
>   at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
>   at 
> scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25387) Malformed CSV causes NPE

2018-09-09 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25387:
--

 Summary: Malformed CSV causes NPE
 Key: SPARK-25387
 URL: https://issues.apache.org/jira/browse/SPARK-25387
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Loading a malformed CSV files or a dataset can cause NullPointerException, for 
example the code:
{code:scala}
val schema = StructType(StructField("a", IntegerType) :: Nil)
val input = spark.createDataset(Seq("\u\u\u0001234"))
spark.read.schema(schema).csv(input).collect()
{code} 
crashes with the exception:
{code:java}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
{code}

If schema is not specified, the following exception is thrown:
{code:java}
java.lang.NullPointerException
at 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
at 
scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
at 
scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25371) Vector Assembler with no input columns leads to opaque error

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608444#comment-16608444
 ] 

Apache Spark commented on SPARK-25371:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22373

> Vector Assembler with no input columns leads to opaque error
> 
>
> Key: SPARK-25371
> URL: https://issues.apache.org/jira/browse/SPARK-25371
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Victor Alor
>Priority: Trivial
>
> When `VectorAssembler ` is given an empty array as its inputColumns it throws 
> an opaque error. In versions less than 2.3 `VectorAssembler` it simply 
> appends a column containing empty vectors. 
>  
> {code:java}
> val inputCols = Array()
> val outputCols = Array("A")
> val vectorAssembler = new VectorAssembler()
> .setInputCols(inputCols)
> .setOutputCol(outputCols)
> vectorAssmbler.fit(data).transform(df)
> {code}
> In versions 2.3 > this throws the exception below
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due 
> to data type mismatch: input to function named_struct requires at least one 
> argument;;
> {code}
> Whereas in versions less than 2.3 it just adds a column containing an empty 
> vector.
> I'm not certain if this is an intentional choice or an actual bug. If this is 
> a bug, the `VectorAssembler` should be modified to append an empty vector 
> column if it detects no inputCols.
>  
> If it is a design decision it would be nice to throw a human readable 
> exception explicitly stating inputColumns must not be empty. The current 
> error is somewhat opaque.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25371) Vector Assembler with no input columns leads to opaque error

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25371:


Assignee: Apache Spark

> Vector Assembler with no input columns leads to opaque error
> 
>
> Key: SPARK-25371
> URL: https://issues.apache.org/jira/browse/SPARK-25371
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Victor Alor
>Assignee: Apache Spark
>Priority: Trivial
>
> When `VectorAssembler ` is given an empty array as its inputColumns it throws 
> an opaque error. In versions less than 2.3 `VectorAssembler` it simply 
> appends a column containing empty vectors. 
>  
> {code:java}
> val inputCols = Array()
> val outputCols = Array("A")
> val vectorAssembler = new VectorAssembler()
> .setInputCols(inputCols)
> .setOutputCol(outputCols)
> vectorAssmbler.fit(data).transform(df)
> {code}
> In versions 2.3 > this throws the exception below
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due 
> to data type mismatch: input to function named_struct requires at least one 
> argument;;
> {code}
> Whereas in versions less than 2.3 it just adds a column containing an empty 
> vector.
> I'm not certain if this is an intentional choice or an actual bug. If this is 
> a bug, the `VectorAssembler` should be modified to append an empty vector 
> column if it detects no inputCols.
>  
> If it is a design decision it would be nice to throw a human readable 
> exception explicitly stating inputColumns must not be empty. The current 
> error is somewhat opaque.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25371) Vector Assembler with no input columns leads to opaque error

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25371:


Assignee: (was: Apache Spark)

> Vector Assembler with no input columns leads to opaque error
> 
>
> Key: SPARK-25371
> URL: https://issues.apache.org/jira/browse/SPARK-25371
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Victor Alor
>Priority: Trivial
>
> When `VectorAssembler ` is given an empty array as its inputColumns it throws 
> an opaque error. In versions less than 2.3 `VectorAssembler` it simply 
> appends a column containing empty vectors. 
>  
> {code:java}
> val inputCols = Array()
> val outputCols = Array("A")
> val vectorAssembler = new VectorAssembler()
> .setInputCols(inputCols)
> .setOutputCol(outputCols)
> vectorAssmbler.fit(data).transform(df)
> {code}
> In versions 2.3 > this throws the exception below
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due 
> to data type mismatch: input to function named_struct requires at least one 
> argument;;
> {code}
> Whereas in versions less than 2.3 it just adds a column containing an empty 
> vector.
> I'm not certain if this is an intentional choice or an actual bug. If this is 
> a bug, the `VectorAssembler` should be modified to append an empty vector 
> column if it detects no inputCols.
>  
> If it is a design decision it would be nice to throw a human readable 
> exception explicitly stating inputColumns must not be empty. The current 
> error is somewhat opaque.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25371) Vector Assembler with no input columns leads to opaque error

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608443#comment-16608443
 ] 

Apache Spark commented on SPARK-25371:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22373

> Vector Assembler with no input columns leads to opaque error
> 
>
> Key: SPARK-25371
> URL: https://issues.apache.org/jira/browse/SPARK-25371
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Victor Alor
>Priority: Trivial
>
> When `VectorAssembler ` is given an empty array as its inputColumns it throws 
> an opaque error. In versions less than 2.3 `VectorAssembler` it simply 
> appends a column containing empty vectors. 
>  
> {code:java}
> val inputCols = Array()
> val outputCols = Array("A")
> val vectorAssembler = new VectorAssembler()
> .setInputCols(inputCols)
> .setOutputCol(outputCols)
> vectorAssmbler.fit(data).transform(df)
> {code}
> In versions 2.3 > this throws the exception below
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due 
> to data type mismatch: input to function named_struct requires at least one 
> argument;;
> {code}
> Whereas in versions less than 2.3 it just adds a column containing an empty 
> vector.
> I'm not certain if this is an intentional choice or an actual bug. If this is 
> a bug, the `VectorAssembler` should be modified to append an empty vector 
> column if it detects no inputCols.
>  
> If it is a design decision it would be nice to throw a human readable 
> exception explicitly stating inputColumns must not be empty. The current 
> error is somewhat opaque.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25385) Upgrade jackson version to 2.7.8

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608430#comment-16608430
 ] 

Apache Spark commented on SPARK-25385:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22372

> Upgrade jackson version to 2.7.8
> 
>
> Key: SPARK-25385
> URL: https://issues.apache.org/jira/browse/SPARK-25385
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> This upgrade to fix {{JsonMappingException}}:
> {noformat}
> export SPARK_PREPEND_CLASSES=true
> build/sbt clean package -Phadoop-3.1
> spark-shell
> scala> spark.range(10).write.parquet("/tmp/spark/parquet")
> com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson 
> version: 2.7.8
>   at 
> com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
>   at 
> com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25385) Upgrade jackson version to 2.7.8

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608429#comment-16608429
 ] 

Apache Spark commented on SPARK-25385:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22372

> Upgrade jackson version to 2.7.8
> 
>
> Key: SPARK-25385
> URL: https://issues.apache.org/jira/browse/SPARK-25385
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> This upgrade to fix {{JsonMappingException}}:
> {noformat}
> export SPARK_PREPEND_CLASSES=true
> build/sbt clean package -Phadoop-3.1
> spark-shell
> scala> spark.range(10).write.parquet("/tmp/spark/parquet")
> com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson 
> version: 2.7.8
>   at 
> com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
>   at 
> com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25385) Upgrade jackson version to 2.7.8

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25385:


Assignee: Apache Spark

> Upgrade jackson version to 2.7.8
> 
>
> Key: SPARK-25385
> URL: https://issues.apache.org/jira/browse/SPARK-25385
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> This upgrade to fix {{JsonMappingException}}:
> {noformat}
> export SPARK_PREPEND_CLASSES=true
> build/sbt clean package -Phadoop-3.1
> spark-shell
> scala> spark.range(10).write.parquet("/tmp/spark/parquet")
> com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson 
> version: 2.7.8
>   at 
> com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
>   at 
> com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25385) Upgrade jackson version to 2.7.8

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25385:


Assignee: (was: Apache Spark)

> Upgrade jackson version to 2.7.8
> 
>
> Key: SPARK-25385
> URL: https://issues.apache.org/jira/browse/SPARK-25385
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> This upgrade to fix {{JsonMappingException}}:
> {noformat}
> export SPARK_PREPEND_CLASSES=true
> build/sbt clean package -Phadoop-3.1
> spark-shell
> scala> spark.range(10).write.parquet("/tmp/spark/parquet")
> com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson 
> version: 2.7.8
>   at 
> com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
>   at 
> com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit call

2018-09-09 Thread Xianyang Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyang Liu updated SPARK-25386:
-
Summary: Don't need to synchronize the IndexShuffleBlockResolver for each 
writeIndexFileAndCommit call  (was: Don't need to synchronize the 
IndexShuffleBlockResolver for each writeIndexFileAndCommit)

> Don't need to synchronize the IndexShuffleBlockResolver for each 
> writeIndexFileAndCommit call
> -
>
> Key: SPARK-25386
> URL: https://issues.apache.org/jira/browse/SPARK-25386
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Now, we need synchronize the instance of IndexShuffleBlockResolver in order 
> to make the commit check and tmp file rename atomically. This can be 
> improved. We could synchronize a lock which is different for each `shuffleId 
> + mapId` instead of  synchronize the indexShuffleBlockResolver for each 
> writeIndexFileAndCommit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit call

2018-09-09 Thread Xianyang Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyang Liu updated SPARK-25386:
-
Description: Now, we need synchronize the instance of 
IndexShuffleBlockResolver in order to make the commit check and tmp file rename 
atomically. This can be improved. We could synchronize a lock which is 
different for each `shuffleId + mapId` instead of  synchronize the 
indexShuffleBlockResolver for each writeIndexFileAndCommit call.  (was: Now, we 
need synchronize the instance of IndexShuffleBlockResolver in order to make the 
commit check and tmp file rename atomically. This can be improved. We could 
synchronize a lock which is different for each `shuffleId + mapId` instead of  
synchronize the indexShuffleBlockResolver for each writeIndexFileAndCommit.)

> Don't need to synchronize the IndexShuffleBlockResolver for each 
> writeIndexFileAndCommit call
> -
>
> Key: SPARK-25386
> URL: https://issues.apache.org/jira/browse/SPARK-25386
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Now, we need synchronize the instance of IndexShuffleBlockResolver in order 
> to make the commit check and tmp file rename atomically. This can be 
> improved. We could synchronize a lock which is different for each `shuffleId 
> + mapId` instead of  synchronize the indexShuffleBlockResolver for each 
> writeIndexFileAndCommit call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608427#comment-16608427
 ] 

Apache Spark commented on SPARK-25386:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22371

> Don't need to synchronize the IndexShuffleBlockResolver for each 
> writeIndexFileAndCommit
> 
>
> Key: SPARK-25386
> URL: https://issues.apache.org/jira/browse/SPARK-25386
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Now, we need synchronize the instance of IndexShuffleBlockResolver in order 
> to make the commit check and tmp file rename atomically. This can be 
> improved. We could synchronize a lock which is different for each `shuffleId 
> + mapId` instead of  synchronize the indexShuffleBlockResolver for each 
> writeIndexFileAndCommit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25386:


Assignee: (was: Apache Spark)

> Don't need to synchronize the IndexShuffleBlockResolver for each 
> writeIndexFileAndCommit
> 
>
> Key: SPARK-25386
> URL: https://issues.apache.org/jira/browse/SPARK-25386
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Now, we need synchronize the instance of IndexShuffleBlockResolver in order 
> to make the commit check and tmp file rename atomically. This can be 
> improved. We could synchronize a lock which is different for each `shuffleId 
> + mapId` instead of  synchronize the indexShuffleBlockResolver for each 
> writeIndexFileAndCommit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit

2018-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25386:


Assignee: Apache Spark

> Don't need to synchronize the IndexShuffleBlockResolver for each 
> writeIndexFileAndCommit
> 
>
> Key: SPARK-25386
> URL: https://issues.apache.org/jira/browse/SPARK-25386
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Xianyang Liu
>Assignee: Apache Spark
>Priority: Major
>
> Now, we need synchronize the instance of IndexShuffleBlockResolver in order 
> to make the commit check and tmp file rename atomically. This can be 
> improved. We could synchronize a lock which is different for each `shuffleId 
> + mapId` instead of  synchronize the indexShuffleBlockResolver for each 
> writeIndexFileAndCommit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit

2018-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608426#comment-16608426
 ] 

Apache Spark commented on SPARK-25386:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22371

> Don't need to synchronize the IndexShuffleBlockResolver for each 
> writeIndexFileAndCommit
> 
>
> Key: SPARK-25386
> URL: https://issues.apache.org/jira/browse/SPARK-25386
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Now, we need synchronize the instance of IndexShuffleBlockResolver in order 
> to make the commit check and tmp file rename atomically. This can be 
> improved. We could synchronize a lock which is different for each `shuffleId 
> + mapId` instead of  synchronize the indexShuffleBlockResolver for each 
> writeIndexFileAndCommit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25385) Upgrade jackson version to 2.7.8

2018-09-09 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25385:

Summary: Upgrade jackson version to 2.7.8  (was: Upgrade 
fasterxml.jackson.databind.version to 2.7.8)

> Upgrade jackson version to 2.7.8
> 
>
> Key: SPARK-25385
> URL: https://issues.apache.org/jira/browse/SPARK-25385
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> This upgrade to fix {{JsonMappingException}}:
> {noformat}
> export SPARK_PREPEND_CLASSES=true
> build/sbt clean package -Phadoop-3.1
> spark-shell
> scala> spark.range(10).write.parquet("/tmp/spark/parquet")
> com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson 
> version: 2.7.8
>   at 
> com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
>   at 
> com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25386) Don't need to synchronize the IndexShuffleBlockResolver for each writeIndexFileAndCommit

2018-09-09 Thread Xianyang Liu (JIRA)
Xianyang Liu created SPARK-25386:


 Summary: Don't need to synchronize the IndexShuffleBlockResolver 
for each writeIndexFileAndCommit
 Key: SPARK-25386
 URL: https://issues.apache.org/jira/browse/SPARK-25386
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Xianyang Liu


Now, we need synchronize the instance of IndexShuffleBlockResolver in order to 
make the commit check and tmp file rename atomically. This can be improved. We 
could synchronize a lock which is different for each `shuffleId + mapId` 
instead of  synchronize the indexShuffleBlockResolver for each 
writeIndexFileAndCommit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25385) Upgrade fasterxml.jackson.databind.version to 2.7.8

2018-09-09 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25385:
---

 Summary: Upgrade fasterxml.jackson.databind.version to 2.7.8
 Key: SPARK-25385
 URL: https://issues.apache.org/jira/browse/SPARK-25385
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 2.4.0
Reporter: Yuming Wang


This upgrade to fix {{JsonMappingException}}:

{noformat}
export SPARK_PREPEND_CLASSES=true
build/sbt clean package -Phadoop-3.1

spark-shell

scala> spark.range(10).write.parquet("/tmp/spark/parquet")

com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson 
version: 2.7.8

  at 
com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)

  at 
com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)

  at 
com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)

  at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)

  at 
org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org