[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-02 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189848#comment-17189848
 ] 

Takeshi Yamamuro commented on SPARK-32778:
--

Have you tried the latest releases, v2.4.6 or v3.0.0?

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-02 Thread Aman Rastogi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Rastogi updated SPARK-32778:
-
Description: 
{code:java}
df.write.option("path", 
"/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
{code}
Above code deleted the data present in path "/already/existing/path". This 
happened because table was already not there in hive metastore however, path 
given had data. And if table is not present in Hive Metastore, SaveMode gets 
modified internally to SaveMode.Overwrite irrespective of what user has 
provided, which leads to data deletion. This change was introduced as part of 
https://issues.apache.org/jira/browse/SPARK-19583. 

Now, suppose if user is not using external hive metastore (hive metastore is 
associated with a cluster) and if cluster goes down or due to some reason user 
has to migrate to a new cluster. Once user tries to save data using above code 
in new cluster, it will first delete the data. It could be a production data 
and user is completely unaware of it as they have provided SaveMode.Append or 
ErrorIfExists. This will be an accidental data deletion.

 

Repro Steps:

 
 # Save data through a hive table as mentioned in above code
 # create another cluster and save data in new table in new cluster by giving 
same path

 

Proposed Fix:

Instead of modifying SaveMode to Overwrite, we should modify it to 
ErrorIfExists in class CreateDataSourceTableAsSelectCommand.

Change (line 154)

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
false)
 
{code}
to

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists 
= false){code}
This should not break CTAS. Even in case of CTAS, user may not want to delete 
data if already exists as it could be accidental.

 

  was:
{code:java}
df.write.option("path", 
"/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
{code}
Above code deleted the data present in path "/already/existing/path". This 
happened because table was already not there in hive metastore however, path 
given had data. And if table is not present in Hive Metastore, SaveMode gets 
modified internally to SaveMode.Overwrite irrespective of what user has 
provided, which leads to data deletion. This change was introduced as part of 
https://issues.apache.org/jira/browse/SPARK-19583. 

Now, suppose if user is not using external hive metastore (hive metastore is 
associated with a cluster) and if cluster goes down or due to some reason user 
has to migrate to a new cluster. Once user tries to save data using above code 
in new cluster, it will first delete the data. It could be a production data 
and user is completely unaware of it as they have provided SaveMode.Append or 
ErrorIfExists. This will be an accidental data deletion.

 

Proposed Fix:

Instead of modifying SaveMode to Overwrite, we should modify it to 
ErrorIfExists in class CreateDataSourceTableAsSelectCommand.

Change (line 154)

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
false)
 
{code}
to

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists 
= false){code}
This should not break CTAS. Even in case of CTAS, user may not want to delete 
data if already exists as it could be accidental.

 


> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or 

[jira] [Commented] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R

2020-09-02 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189841#comment-17189841
 ] 

Takeshi Yamamuro commented on SPARK-32747:
--

Yay! (Note: I've checked all the related PRs/Tickets)

> Deduplicate configuration set/unset in test_sparkSQL_arrow.R
> 
>
> Key: SPARK-32747
> URL: https://issues.apache.org/jira/browse/SPARK-32747
> Project: Spark
>  Issue Type: Test
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` 
> test cases. We can just set once in globally and deduplicate such try-catch 
> logics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R

2020-09-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189835#comment-17189835
 ] 

Hyukjin Kwon commented on SPARK-32747:
--

Thank you Takeshi for correcting it and in other JIRAs!

> Deduplicate configuration set/unset in test_sparkSQL_arrow.R
> 
>
> Key: SPARK-32747
> URL: https://issues.apache.org/jira/browse/SPARK-32747
> Project: Spark
>  Issue Type: Test
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` 
> test cases. We can just set once in globally and deduplicate such try-catch 
> logics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189834#comment-17189834
 ] 

Hyukjin Kwon commented on SPARK-32774:
--

Thanks [~maropu]!

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32771:
-
Fix Version/s: (was: 3.0.1)
   3.0.2

> The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
> 
>
> Key: SPARK-32771
> URL: https://issues.apache.org/jira/browse/SPARK-32771
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
> follows.
> {code:java}
> val customSummer =  new Aggregator[Data, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Data): Int = b + a.i
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(r: Int): Int = r
> }.toColumn(){code}
> But this example doesn't work because it doesn't define bufferEncoder and 
> outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189826#comment-17189826
 ] 

Apache Spark commented on SPARK-32782:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/29633

> Refactory StreamingRelationV2 and move it to catalyst
> -
>
> Key: SPARK-32782
> URL: https://issues.apache.org/jira/browse/SPARK-32782
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Currently, the StreamingRelationV2 is bind with TableProvider. To make it 
> more flexible and have better expansibility, it should be moved to the 
> catalyst module and bind with the Table interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32774:
-
Fix Version/s: (was: 3.0.1)
   3.0.2

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32782:


Assignee: (was: Apache Spark)

> Refactory StreamingRelationV2 and move it to catalyst
> -
>
> Key: SPARK-32782
> URL: https://issues.apache.org/jira/browse/SPARK-32782
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Currently, the StreamingRelationV2 is bind with TableProvider. To make it 
> more flexible and have better expansibility, it should be moved to the 
> catalyst module and bind with the Table interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189829#comment-17189829
 ] 

Apache Spark commented on SPARK-32782:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/29633

> Refactory StreamingRelationV2 and move it to catalyst
> -
>
> Key: SPARK-32782
> URL: https://issues.apache.org/jira/browse/SPARK-32782
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Currently, the StreamingRelationV2 is bind with TableProvider. To make it 
> more flexible and have better expansibility, it should be moved to the 
> catalyst module and bind with the Table interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32782:


Assignee: Apache Spark

> Refactory StreamingRelationV2 and move it to catalyst
> -
>
> Key: SPARK-32782
> URL: https://issues.apache.org/jira/browse/SPARK-32782
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the StreamingRelationV2 is bind with TableProvider. To make it 
> more flexible and have better expansibility, it should be moved to the 
> catalyst module and bind with the Table interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-02 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189828#comment-17189828
 ] 

Takeshi Yamamuro commented on SPARK-32771:
--

Since v3.0.1 does not include this fix, I reset "Target Version/s" from 3.0.1 
to 3.0.2.

> The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
> 
>
> Key: SPARK-32771
> URL: https://issues.apache.org/jira/browse/SPARK-32771
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
> follows.
> {code:java}
> val customSummer =  new Aggregator[Data, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Data): Int = b + a.i
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(r: Int): Int = r
> }.toColumn(){code}
> But this example doesn't work because it doesn't define bufferEncoder and 
> outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-02 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189827#comment-17189827
 ] 

Takeshi Yamamuro commented on SPARK-32774:
--

Since v3.0.1 does not include this fix, I reset "Target Version/s" from 3.0.1 
to 3.0.2.

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R

2020-09-02 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189825#comment-17189825
 ] 

Takeshi Yamamuro commented on SPARK-32747:
--

Since v3.0.1 does not include this fix, I reset "Target Version/s" from 3.0.1 
to 3.0.2.

> Deduplicate configuration set/unset in test_sparkSQL_arrow.R
> 
>
> Key: SPARK-32747
> URL: https://issues.apache.org/jira/browse/SPARK-32747
> Project: Spark
>  Issue Type: Test
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` 
> test cases. We can just set once in globally and deduplicate such try-catch 
> logics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R

2020-09-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32747:
-
Fix Version/s: (was: 3.0.1)
   3.0.2

> Deduplicate configuration set/unset in test_sparkSQL_arrow.R
> 
>
> Key: SPARK-32747
> URL: https://issues.apache.org/jira/browse/SPARK-32747
> Project: Spark
>  Issue Type: Test
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` 
> test cases. We can just set once in globally and deduplicate such try-catch 
> logics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-09-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32693:
-
Fix Version/s: (was: 3.0.1)
   3.0.2

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-09-02 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189823#comment-17189823
 ] 

Takeshi Yamamuro commented on SPARK-32693:
--

Unfortunately, v3.0.1 does not include this fix, so I reset it back: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-RESULT-Release-Spark-3-0-1-RC3-td30114.html]

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst

2020-09-02 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-32782:
---

 Summary: Refactory StreamingRelationV2 and move it to catalyst
 Key: SPARK-32782
 URL: https://issues.apache.org/jira/browse/SPARK-32782
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Yuanjian Li


Currently, the StreamingRelationV2 is bind with TableProvider. To make it more 
flexible and have better expansibility, it should be moved to the catalyst 
module and bind with the Table interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32739) support prune right for left semi join in DPP

2020-09-02 Thread Zhenhua Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189819#comment-17189819
 ] 

Zhenhua Wang commented on SPARK-32739:
--

[~maropu]  Added description, thanks for reminding.

> support prune right for left semi join in DPP
> -
>
> Key: SPARK-32739
> URL: https://issues.apache.org/jira/browse/SPARK-32739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently in DPP, left semi can only prune left, it should also support prune 
> right.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32739) support prune right for left semi join in DPP

2020-09-02 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-32739:
-
Description: Currently in DPP, left semi can only prune left, it should 
also support prune right.

> support prune right for left semi join in DPP
> -
>
> Key: SPARK-32739
> URL: https://issues.apache.org/jira/browse/SPARK-32739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently in DPP, left semi can only prune left, it should also support prune 
> right.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189808#comment-17189808
 ] 

Apache Spark commented on SPARK-32781:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29632

> Non-ASCII characters are mistakenly omitted in the middle of intervals
> --
>
> Key: SPARK-32781
> URL: https://issues.apache.org/jira/browse/SPARK-32781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> select interval '1中国day'; 
> we should fail case above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32781:


Assignee: (was: Apache Spark)

> Non-ASCII characters are mistakenly omitted in the middle of intervals
> --
>
> Key: SPARK-32781
> URL: https://issues.apache.org/jira/browse/SPARK-32781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> select interval '1中国day'; 
> we should fail case above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32781:


Assignee: Apache Spark

> Non-ASCII characters are mistakenly omitted in the middle of intervals
> --
>
> Key: SPARK-32781
> URL: https://issues.apache.org/jira/browse/SPARK-32781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> select interval '1中国day'; 
> we should fail case above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189807#comment-17189807
 ] 

Apache Spark commented on SPARK-32781:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29632

> Non-ASCII characters are mistakenly omitted in the middle of intervals
> --
>
> Key: SPARK-32781
> URL: https://issues.apache.org/jira/browse/SPARK-32781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> select interval '1中国day'; 
> we should fail case above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32738) thread safe endpoints may hang due to fatal error

2020-09-02 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-32738:
-
Description: 
Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
`Inbox`. Now if any fatal error happens during `Inbox.process`, 
'numActiveThreads' is not reduced. Then other threads can not process messages 
in that inbox, which causes the endpoint to "hang".

This problem is more serious in previous Spark 2.x versions since the driver, 
executor and block manager endpoints are all thread safe endpoints.

  was:
Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
`Inbox`. Now if any fatal exception happens during `Inbox.process`, 
'numActiveThreads' is not reduced. Then other threads can not process messages 
in that inbox, which causes the endpoint to hang.

This problem is more serious in previous Spark 2.x versions since the driver, 
executor and block manager endpoints are all thread safe endpoints.


> thread safe endpoints may hang due to fatal error
> -
>
> Key: SPARK-32738
> URL: https://issues.apache.org/jira/browse/SPARK-32738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Zhenhua Wang
>Priority: Major
>
> Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
> `Inbox`. Now if any fatal error happens during `Inbox.process`, 
> 'numActiveThreads' is not reduced. Then other threads can not process 
> messages in that inbox, which causes the endpoint to "hang".
> This problem is more serious in previous Spark 2.x versions since the driver, 
> executor and block manager endpoints are all thread safe endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals

2020-09-02 Thread Kent Yao (Jira)
Kent Yao created SPARK-32781:


 Summary: Non-ASCII characters are mistakenly omitted in the middle 
of intervals
 Key: SPARK-32781
 URL: https://issues.apache.org/jira/browse/SPARK-32781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


select interval '1中国day'; 

we should fail case above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32738) thread safe endpoints may hang due to fatal error

2020-09-02 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-32738:
-
Summary: thread safe endpoints may hang due to fatal error  (was: thread 
safe endpoints may hang due to fatal exception)

> thread safe endpoints may hang due to fatal error
> -
>
> Key: SPARK-32738
> URL: https://issues.apache.org/jira/browse/SPARK-32738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Zhenhua Wang
>Priority: Major
>
> Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
> `Inbox`. Now if any fatal exception happens during `Inbox.process`, 
> 'numActiveThreads' is not reduced. Then other threads can not process 
> messages in that inbox, which causes the endpoint to hang.
> This problem is more serious in previous Spark 2.x versions since the driver, 
> executor and block manager endpoints are all thread safe endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-09-02 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189785#comment-17189785
 ] 

huangtianhua commented on SPARK-32691:
--

[~dongjoon], ok, thanks. And if anyone wants to reproduce the failure on ARM, I 
can provide an arm instance:)

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Priority: Major
> Attachments: failure.log, success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-09-02 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189784#comment-17189784
 ] 

Kazuaki Ishizaki commented on SPARK-32312:
--

I think that [this|https://github.com/apache/arrow/pull/7746] is a work to 
succeed a build. What work do we need further?

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189779#comment-17189779
 ] 

Apache Spark commented on SPARK-32772:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29631

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189778#comment-17189778
 ] 

Apache Spark commented on SPARK-32772:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29631

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32637) SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure SQL DB

2020-09-02 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189771#comment-17189771
 ] 

Takeshi Yamamuro commented on SPARK-32637:
--

+1 on the Maxim comment and I'll close this.

> SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure 
> SQL DB 
> -
>
> Key: SPARK-32637
> URL: https://issues.apache.org/jira/browse/SPARK-32637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Mohit Dave
>Priority: Major
>
> SPARK jdbc is truncating TIMESTAMP values for the microsecond when datetime2 
> datatype is used for Microsoft SQL Server JDBC driver.
>  
> Source data(datetime2) : '2007-08-08 12:35:29.1234567'
>  
> After loading to target using SPARK dataframes
>  
> Target data(datetime2) : '2007-08-08 12:35:29.1234560'
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32637) SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure SQL DB

2020-09-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32637.
--
Resolution: Invalid

> SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure 
> SQL DB 
> -
>
> Key: SPARK-32637
> URL: https://issues.apache.org/jira/browse/SPARK-32637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Mohit Dave
>Priority: Major
>
> SPARK jdbc is truncating TIMESTAMP values for the microsecond when datetime2 
> datatype is used for Microsoft SQL Server JDBC driver.
>  
> Source data(datetime2) : '2007-08-08 12:35:29.1234567'
>  
> After loading to target using SPARK dataframes
>  
> Target data(datetime2) : '2007-08-08 12:35:29.1234560'
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission

2020-09-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32734:
-
Labels:   (was: pull-request-available)

> RDD actions in DStream.transfrom delays batch submission
> 
>
> Key: SPARK-32734
> URL: https://issues.apache.org/jira/browse/SPARK-32734
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h4. Issue
> Some spark applications have batch creation delay after running for some 
> time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the 
> latest batch doesn't match current time.
>   
> ||Clock||BatchTime||
> |10:00|10:00|
> |10:02|10:01|
> |10:04|10:02|
> |10:06|10:03|
> h4. Investigation
> We observe such applications share a commonality that rdd actions exist in 
> dstream.transfrom. Those actions will be executed in dstream.compute, which 
> is called by JobGenerator. JobGenerator runs with a single thread event loop 
> so any synchronized operations will block event processing.
> h4. Proposal
> delegate dstream.compute to JobSchduler
>  
> {code:java}
> // class ForEachDStream
> override def generateJob(time: Time): Option[Job] = {
>   val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
> parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time))
>   }
>   Some(new Job(time, jobFunc))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page

2020-09-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32735:
-
Labels:   (was: pull-request-available)

> RDD actions in DStream.transfrom don't show at batch page
> -
>
> Key: SPARK-32735
> URL: https://issues.apache.org/jira/browse/SPARK-32735
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Web UI
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Priority: Major
>
> h4. Issue
> {code:java}
> val lines = ssc.socketTextStream("localhost", )
> val words = lines.flatMap(_.split(" "))
> val mappedStream= words.transform(rdd => {
>   val c = rdd.count();
>   rdd.map(x => s"$c x")}
> )
> mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code}
> Every batch two spark jobs are created. Only the second one is associated 
> with the streaming output operation and shows at batch page.
> h4. Investigation
> The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch 
> time and output op id are not available in spark context because they are set 
> in JobScheduler later.
> h4. Proposal
> delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run 
> in spark context with correct local properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189766#comment-17189766
 ] 

Apache Spark commented on SPARK-32735:
--

User 'Olwn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29578

> RDD actions in DStream.transfrom don't show at batch page
> -
>
> Key: SPARK-32735
> URL: https://issues.apache.org/jira/browse/SPARK-32735
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Web UI
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Priority: Major
>  Labels: pull-request-available
>
> h4. Issue
> {code:java}
> val lines = ssc.socketTextStream("localhost", )
> val words = lines.flatMap(_.split(" "))
> val mappedStream= words.transform(rdd => {
>   val c = rdd.count();
>   rdd.map(x => s"$c x")}
> )
> mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code}
> Every batch two spark jobs are created. Only the second one is associated 
> with the streaming output operation and shows at batch page.
> h4. Investigation
> The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch 
> time and output op id are not available in spark context because they are set 
> in JobScheduler later.
> h4. Proposal
> delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run 
> in spark context with correct local properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32735:


Assignee: Apache Spark

> RDD actions in DStream.transfrom don't show at batch page
> -
>
> Key: SPARK-32735
> URL: https://issues.apache.org/jira/browse/SPARK-32735
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Web UI
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> h4. Issue
> {code:java}
> val lines = ssc.socketTextStream("localhost", )
> val words = lines.flatMap(_.split(" "))
> val mappedStream= words.transform(rdd => {
>   val c = rdd.count();
>   rdd.map(x => s"$c x")}
> )
> mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code}
> Every batch two spark jobs are created. Only the second one is associated 
> with the streaming output operation and shows at batch page.
> h4. Investigation
> The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch 
> time and output op id are not available in spark context because they are set 
> in JobScheduler later.
> h4. Proposal
> delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run 
> in spark context with correct local properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189765#comment-17189765
 ] 

Apache Spark commented on SPARK-32735:
--

User 'Olwn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29578

> RDD actions in DStream.transfrom don't show at batch page
> -
>
> Key: SPARK-32735
> URL: https://issues.apache.org/jira/browse/SPARK-32735
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Web UI
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Priority: Major
>  Labels: pull-request-available
>
> h4. Issue
> {code:java}
> val lines = ssc.socketTextStream("localhost", )
> val words = lines.flatMap(_.split(" "))
> val mappedStream= words.transform(rdd => {
>   val c = rdd.count();
>   rdd.map(x => s"$c x")}
> )
> mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code}
> Every batch two spark jobs are created. Only the second one is associated 
> with the streaming output operation and shows at batch page.
> h4. Investigation
> The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch 
> time and output op id are not available in spark context because they are set 
> in JobScheduler later.
> h4. Proposal
> delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run 
> in spark context with correct local properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32735:


Assignee: (was: Apache Spark)

> RDD actions in DStream.transfrom don't show at batch page
> -
>
> Key: SPARK-32735
> URL: https://issues.apache.org/jira/browse/SPARK-32735
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Web UI
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Priority: Major
>  Labels: pull-request-available
>
> h4. Issue
> {code:java}
> val lines = ssc.socketTextStream("localhost", )
> val words = lines.flatMap(_.split(" "))
> val mappedStream= words.transform(rdd => {
>   val c = rdd.count();
>   rdd.map(x => s"$c x")}
> )
> mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code}
> Every batch two spark jobs are created. Only the second one is associated 
> with the streaming output operation and shows at batch page.
> h4. Investigation
> The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch 
> time and output op id are not available in spark context because they are set 
> in JobScheduler later.
> h4. Proposal
> delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run 
> in spark context with correct local properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189764#comment-17189764
 ] 

Apache Spark commented on SPARK-32734:
--

User 'Olwn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29578

> RDD actions in DStream.transfrom delays batch submission
> 
>
> Key: SPARK-32734
> URL: https://issues.apache.org/jira/browse/SPARK-32734
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h4. Issue
> Some spark applications have batch creation delay after running for some 
> time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the 
> latest batch doesn't match current time.
>   
> ||Clock||BatchTime||
> |10:00|10:00|
> |10:02|10:01|
> |10:04|10:02|
> |10:06|10:03|
> h4. Investigation
> We observe such applications share a commonality that rdd actions exist in 
> dstream.transfrom. Those actions will be executed in dstream.compute, which 
> is called by JobGenerator. JobGenerator runs with a single thread event loop 
> so any synchronized operations will block event processing.
> h4. Proposal
> delegate dstream.compute to JobSchduler
>  
> {code:java}
> // class ForEachDStream
> override def generateJob(time: Time): Option[Job] = {
>   val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
> parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time))
>   }
>   Some(new Job(time, jobFunc))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32734:


Assignee: Apache Spark

> RDD actions in DStream.transfrom delays batch submission
> 
>
> Key: SPARK-32734
> URL: https://issues.apache.org/jira/browse/SPARK-32734
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h4. Issue
> Some spark applications have batch creation delay after running for some 
> time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the 
> latest batch doesn't match current time.
>   
> ||Clock||BatchTime||
> |10:00|10:00|
> |10:02|10:01|
> |10:04|10:02|
> |10:06|10:03|
> h4. Investigation
> We observe such applications share a commonality that rdd actions exist in 
> dstream.transfrom. Those actions will be executed in dstream.compute, which 
> is called by JobGenerator. JobGenerator runs with a single thread event loop 
> so any synchronized operations will block event processing.
> h4. Proposal
> delegate dstream.compute to JobSchduler
>  
> {code:java}
> // class ForEachDStream
> override def generateJob(time: Time): Option[Job] = {
>   val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
> parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time))
>   }
>   Some(new Job(time, jobFunc))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32734:


Assignee: (was: Apache Spark)

> RDD actions in DStream.transfrom delays batch submission
> 
>
> Key: SPARK-32734
> URL: https://issues.apache.org/jira/browse/SPARK-32734
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Liechuan Ou
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h4. Issue
> Some spark applications have batch creation delay after running for some 
> time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the 
> latest batch doesn't match current time.
>   
> ||Clock||BatchTime||
> |10:00|10:00|
> |10:02|10:01|
> |10:04|10:02|
> |10:06|10:03|
> h4. Investigation
> We observe such applications share a commonality that rdd actions exist in 
> dstream.transfrom. Those actions will be executed in dstream.compute, which 
> is called by JobGenerator. JobGenerator runs with a single thread event loop 
> so any synchronized operations will block event processing.
> h4. Proposal
> delegate dstream.compute to JobSchduler
>  
> {code:java}
> // class ForEachDStream
> override def generateJob(time: Time): Option[Job] = {
>   val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
> parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time))
>   }
>   Some(new Job(time, jobFunc))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32780) Fill since fields for all the expressions

2020-09-02 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-32780:


 Summary: Fill since fields for all the expressions
 Key: SPARK-32780
 URL: https://issues.apache.org/jira/browse/SPARK-32780
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


Some since files in ExpressionDescription are missing now, it is worth filling 
them to make documents better;
{code:java}
  test("Since has a valid value") {
val badExpressions = spark.sessionState.functionRegistry.listFunction()
  .map(spark.sessionState.catalog.lookupFunctionInfo)
  .filter(funcInfo => 
!funcInfo.getSince.matches("[0-9]+\\.[0-9]+\\.[0-9]+"))
  .map(_.getClassName)
  .distinct
  .sorted

if (badExpressions.nonEmpty) {
  fail(s"${badExpressions.length} expressions with invalid 'since':\n"
+ badExpressions.mkString("\n"))
}
  }
[info] - Since has a valid value *** FAILED *** (16 milliseconds)
[info]   67 expressions with invalid 'since':
[info]   org.apache.spark.sql.catalyst.expressions.Abs
[info]   org.apache.spark.sql.catalyst.expressions.Add
[info]   org.apache.spark.sql.catalyst.expressions.And
[info]   org.apache.spark.sql.catalyst.expressions.ArrayContains
[info]   org.apache.spark.sql.catalyst.expressions.AssertTrue
[info]   org.apache.spark.sql.catalyst.expressions.BitwiseAnd
[info]   org.apache.spark.sql.catalyst.expressions.BitwiseNot
[info]   org.apache.spark.sql.catalyst.expressions.BitwiseOr
[info]   org.apache.spark.sql.catalyst.expressions.BitwiseXor
[info]   org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
[info]   org.apache.spark.sql.catalyst.expressions.CaseWhen
[info]   org.apache.spark.sql.catalyst.expressions.Cast
[info]   org.apache.spark.sql.catalyst.expressions.Concat
[info]   org.apache.spark.sql.catalyst.expressions.Crc32
[info]   org.apache.spark.sql.catalyst.expressions.CreateArray
[info]   org.apache.spark.sql.catalyst.expressions.CreateMap
[info]   org.apache.spark.sql.catalyst.expressions.CreateNamedStruct
[info]   org.apache.spark.sql.catalyst.expressions.CurrentDatabase
[info]   org.apache.spark.sql.catalyst.expressions.Divide
[info]   org.apache.spark.sql.catalyst.expressions.EqualNullSafe
[info]   org.apache.spark.sql.catalyst.expressions.EqualTo
[info]   org.apache.spark.sql.catalyst.expressions.Explode
[info]   org.apache.spark.sql.catalyst.expressions.GetJsonObject
[info]   org.apache.spark.sql.catalyst.expressions.GreaterThan
[info]   org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual
[info]   org.apache.spark.sql.catalyst.expressions.Greatest
[info]   org.apache.spark.sql.catalyst.expressions.If
[info]   org.apache.spark.sql.catalyst.expressions.In
[info]   org.apache.spark.sql.catalyst.expressions.Inline
[info]   org.apache.spark.sql.catalyst.expressions.InputFileBlockLength
[info]   org.apache.spark.sql.catalyst.expressions.InputFileBlockStart
[info]   org.apache.spark.sql.catalyst.expressions.InputFileName
[info]   org.apache.spark.sql.catalyst.expressions.JsonTuple
[info]   org.apache.spark.sql.catalyst.expressions.Least
[info]   org.apache.spark.sql.catalyst.expressions.LessThan
[info]   org.apache.spark.sql.catalyst.expressions.LessThanOrEqual
[info]   org.apache.spark.sql.catalyst.expressions.MapKeys
[info]   org.apache.spark.sql.catalyst.expressions.MapValues
[info]   org.apache.spark.sql.catalyst.expressions.Md5
[info]   org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
[info]   org.apache.spark.sql.catalyst.expressions.Multiply
[info]   org.apache.spark.sql.catalyst.expressions.Murmur3Hash
[info]   org.apache.spark.sql.catalyst.expressions.Not
[info]   org.apache.spark.sql.catalyst.expressions.Or
[info]   org.apache.spark.sql.catalyst.expressions.Overlay
[info]   org.apache.spark.sql.catalyst.expressions.Pmod
[info]   org.apache.spark.sql.catalyst.expressions.PosExplode
[info]   org.apache.spark.sql.catalyst.expressions.Remainder
[info]   org.apache.spark.sql.catalyst.expressions.Sha1
[info]   org.apache.spark.sql.catalyst.expressions.Sha2
[info]   org.apache.spark.sql.catalyst.expressions.Size
[info]   org.apache.spark.sql.catalyst.expressions.SortArray
[info]   org.apache.spark.sql.catalyst.expressions.SparkPartitionID
[info]   org.apache.spark.sql.catalyst.expressions.Stack
[info]   org.apache.spark.sql.catalyst.expressions.Subtract
[info]   org.apache.spark.sql.catalyst.expressions.TimeWindow
[info]   org.apache.spark.sql.catalyst.expressions.UnaryMinus
[info]   org.apache.spark.sql.catalyst.expressions.UnaryPositive
[info]   org.apache.spark.sql.catalyst.expressions.Uuid
[info]   org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean
[info]   org.apache.spark.sql.catalyst.expressions.xml.XPathDouble
[info]   org.apache.spark.sql.catalyst.expressions.xml.XPathFloat
[info]   org.apache.spark.sql.catalyst.expressions.xml.XPathInt
[info]   

[jira] [Commented] (SPARK-32097) Allow reading history log files from multiple directories

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189758#comment-17189758
 ] 

Apache Spark commented on SPARK-32097:
--

User 'Gaurangi94' has created a pull request for this issue:
https://github.com/apache/spark/pull/29630

> Allow reading history log files from multiple directories
> -
>
> Key: SPARK-32097
> URL: https://issues.apache.org/jira/browse/SPARK-32097
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> We would like to configure SparkHistoryServer to display applications from 
> multiple clusters/environments. Data displayed on this UI comes from 
> directory configured as log-directory. It would be nice if this log-directory 
> also accepted regex. This way we will be able to read and display 
> applications from multiple directories.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32097) Allow reading history log files from multiple directories

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32097:


Assignee: (was: Apache Spark)

> Allow reading history log files from multiple directories
> -
>
> Key: SPARK-32097
> URL: https://issues.apache.org/jira/browse/SPARK-32097
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> We would like to configure SparkHistoryServer to display applications from 
> multiple clusters/environments. Data displayed on this UI comes from 
> directory configured as log-directory. It would be nice if this log-directory 
> also accepted regex. This way we will be able to read and display 
> applications from multiple directories.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32097) Allow reading history log files from multiple directories

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32097:


Assignee: Apache Spark

> Allow reading history log files from multiple directories
> -
>
> Key: SPARK-32097
> URL: https://issues.apache.org/jira/browse/SPARK-32097
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Assignee: Apache Spark
>Priority: Minor
>
> We would like to configure SparkHistoryServer to display applications from 
> multiple clusters/environments. Data displayed on this UI comes from 
> directory configured as log-directory. It would be nice if this log-directory 
> also accepted regex. This way we will be able to read and display 
> applications from multiple directories.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-02 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-32779:
-

 Summary: Spark/Hive3 interaction potentially causes deadlock
 Key: SPARK-32779
 URL: https://issues.apache.org/jira/browse/SPARK-32779
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Bruce Robbins


This is an issue for applications that share a Spark Session across multiple 
threads.

sessionCatalog.loadPartition (after checking that the table exists) grabs locks 
in this order:
 - HiveExternalCatalog
 - HiveSessionCatalog (in Shim_v3_0)

Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
 - HiveSessionCatalog
 - HiveExternalCatalog

[This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
 appears to be the culprit. Maybe db name should be defaulted _before_ the call 
to HiveClient so that Shim_v3_0 doesn't have to call back into SessionCatalog. 
Or possibly this is not needed at all, since loadPartition in Shim_v2_1 doesn't 
worry about the default db name, but that might be because of differences 
between Hive client libraries.

Reproduction case:
 - You need to have a running Hive 3.x HMS instance and the appropriate 
hive-site.xml for your Spark instance
 - Adjust your spark.sql.hive.metastore.version accordingly
 - It might take more than one try to hit the deadlock

Launch Spark:
{noformat}
bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
--conf spark.sql.hive.metastore.version=3.1
{noformat}
Then use the following code:
{noformat}
import scala.collection.mutable.ArrayBuffer
import scala.util.Random

val tableCount = 4
for (i <- 0 until tableCount) {
  val tableName = s"partitioned${i+1}"
  sql(s"drop table if exists $tableName")
  sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored as 
orc")
}

val threads = new ArrayBuffer[Thread]
for (i <- 0 until tableCount) {
  threads.append(new Thread( new Runnable {
override def run: Unit = {
  val tableName = s"partitioned${i + 1}"
  val rand = Random
  val df = spark.range(0, 2).toDF("a")
  val location = s"/tmp/${rand.nextLong.abs}"
  df.write.mode("overwrite").orc(location)
  sql(
s"""
LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
(b=$i)""")
}
  }, s"worker$i"))
  threads(i).start()
}

for (i <- 0 until tableCount) {
  println(s"Joining with thread $i")
  threads(i).join()
}
println("All done")
{noformat}
The job often gets stuck after one or two "Joining..." lines.

{{kill -3}} shows something like this:
{noformat}
Found one Java-level deadlock:
=
"worker3":
  waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
org.apache.spark.sql.hive.HiveSessionCatalog),
  which is held by "worker0"
"worker0":
  waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
org.apache.spark.sql.hive.HiveExternalCatalog),
  which is held by "worker3"
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189744#comment-17189744
 ] 

Apache Spark commented on SPARK-32135:
--

User 'Gaurangi94' has created a pull request for this issue:
https://github.com/apache/spark/pull/29629

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32135:


Assignee: (was: Apache Spark)

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32135:


Assignee: Apache Spark

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Assignee: Apache Spark
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189743#comment-17189743
 ] 

Apache Spark commented on SPARK-32135:
--

User 'Gaurangi94' has created a pull request for this issue:
https://github.com/apache/spark/pull/29629

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32764:
--
Description: 
{code:scala}
 val spark: SparkSession = SparkSession
  .builder()
  .master("local")
  .appName("SparkByExamples.com")
  .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

import spark.sqlContext.implicits._

val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
  .withColumn("comp", col("neg") < col("pos"))
  df.show(false)

==

++---++
|neg |pos|comp|
++---++
|-0.0|0.0|true|
++---++{code}

I think that result should be false.

**Apache Spark 2.4.6 RESULT**
{code}
scala> spark.version
res0: String = 2.4.6

scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
col("pos")).show
++---+-+
| neg|pos| comp|
++---+-+
|-0.0|0.0|false|
++---+-+
{code}

  was:
{code:scala}
 val spark: SparkSession = SparkSession
  .builder()
  .master("local")
  .appName("SparkByExamples.com")
  .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

import spark.sqlContext.implicits._

val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
  .withColumn("comp", col("neg") < col("pos"))
  df.show(false)

==

++---++
|neg |pos|comp|
++---++
|-0.0|0.0|true|
++---++{code}

I think that result should be false.


> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-02 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189720#comment-17189720
 ] 

Dongjoon Hyun commented on SPARK-32764:
---

cc [~cloud_fan] and [~smilegator]

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32764:
--
Affects Version/s: 3.0.1

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-02 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189719#comment-17189719
 ] 

Dongjoon Hyun commented on SPARK-32764:
---

Thank you for reporting, [~igreenfi]. I also confirm this regression.

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Izek Greenfield
>Priority: Major
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32764:
--
Target Version/s: 3.0.2

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32764:
--
Labels: correctness  (was: )

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32766) s3a: bucket names with dots cannot be used

2020-09-02 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189701#comment-17189701
 ] 

Dongjoon Hyun commented on SPARK-32766:
---

Thank you for the pointer, [~ste...@apache.org].

> s3a: bucket names with dots cannot be used
> --
>
> Key: SPARK-32766
> URL: https://issues.apache.org/jira/browse/SPARK-32766
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> Running vanilla spark with
> {noformat}
> --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat}
> I cannot read from S3, if the bucket name contains a dot (a valid name).
> A minimal reproducible example looks like this
> {{from pyspark.sql import SparkSession}}
> {{import pyspark.sql.functions as f}}
> {{if __name__ == '__main__':}}
> {{  spark = (SparkSession}}
> {{    .builder}}
> {{    .appName('my_app')}}
> {{    .master("local[*]")}}
> {{    .getOrCreate()}}
> {{  )}}
> {{  spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}}
> Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read 
> that CSV. I created the same bucket without the period and it worked fine.
> *Now I'm not sure whether this is a thing of prepping the path names and 
> passing them to the aws-sdk, or whether the fault is within the SDK itself. I 
> am not Java savvy to investigate the issue further, but I tried to make the 
> repro as short as possible.*
> 
> I get different errors depending on which Hadoop distributions I use. If I 
> use the default PySpark distribution (which includes Hadoop 2), I get the 
> following (using hadoop-aws:2.7.4)
> {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}}
> {{java.lang.IllegalArgumentException: The bucketName parameter must be 
> specified.}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}}
> {{ at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}}
> {{ at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}}
> {{ at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
> {{ ... 47 elided}}
> When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this 
> error (with hadoop-aws:3.2.0):
> {{java.lang.NullPointerException: null uri host.}}
> {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}}
> {{ at 
> org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}}
> {{ at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}}
> {{ at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}}
> {{ at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> 

[jira] [Resolved] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32772.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29619
[https://github.com/apache/spark/pull/29619]

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled

2020-09-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32077.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28911
[https://github.com/apache/spark/pull/28911]

> Support host-local shuffle data reading with external shuffle service disabled
> --
>
> Key: SPARK-32077
> URL: https://issues.apache.org/jira/browse/SPARK-32077
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> After SPARK-27651, Spark can read host-local shuffle data directly from disk 
> with external shuffle service enabled. To extend the future, we can also 
> support it with external shuffle service disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled

2020-09-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32077:
-

Assignee: wuyi

> Support host-local shuffle data reading with external shuffle service disabled
> --
>
> Key: SPARK-32077
> URL: https://issues.apache.org/jira/browse/SPARK-32077
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> After SPARK-27651, Spark can read host-local shuffle data directly from disk 
> with external shuffle service enabled. To extend the future, we can also 
> support it with external shuffle service disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-02 Thread Gaurangi Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189668#comment-17189668
 ] 

Gaurangi Saxena edited comment on SPARK-32135 at 9/2/20, 7:45 PM:
--

[~tgraves] Apologies for the late response.

We would like to see the driver host of an application on the history server 
page. We could see it in executors page, but we would need 2 clicks to get 
there.

This feature is useful in a multi-cluster environment with a single JHS used to 
index history files for all clusters to understand on what cluster a job was 
executed. Driver-host will help identify the cluster. 

 

I have added another Jira (https://issues.apache.org/jira/browse/SPARK-32097) 
that will allow log-directory to accept wild-cards. Together these changes can 
help configuring a single UI for multiple spark clusters.

 

!image-2020-09-02-12-37-55-860.png!


was (Author: gaurangi):
[~tgraves] Apologies for the late response.

We would like to see the driver host of an application on the history server 
page. We could see it in executors page, but we would need 2 clicks to get 
there.

This feature is useful in a multi-cluster environment with a single JHS used to 
index history files for all clusters to understand on what cluster a job was 
executed. Driver-host will help identify the cluster. 

 

!image-2020-09-02-12-37-55-860.png!

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32097) Allow reading history log files from multiple directories

2020-09-02 Thread Gaurangi Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurangi Saxena updated SPARK-32097:

Description: 
We would like to configure SparkHistoryServer to display applications from 
multiple clusters/environments. Data displayed on this UI comes from directory 
configured as log-directory. It would be nice if this log-directory also 
accepted regex. This way we will be able to read and display applications from 
multiple directories.

 

  was:We would like to have a regex kind support in displaying log files on the 
Spark history server. This way we will be able to see applications that were 
run on multiple clusters instead of just one.


> Allow reading history log files from multiple directories
> -
>
> Key: SPARK-32097
> URL: https://issues.apache.org/jira/browse/SPARK-32097
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> We would like to configure SparkHistoryServer to display applications from 
> multiple clusters/environments. Data displayed on this UI comes from 
> directory configured as log-directory. It would be nice if this log-directory 
> also accepted regex. This way we will be able to read and display 
> applications from multiple directories.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32097) Allow reading history log files from multiple directories

2020-09-02 Thread Gaurangi Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurangi Saxena updated SPARK-32097:

Issue Type: Wish  (was: Bug)

> Allow reading history log files from multiple directories
> -
>
> Key: SPARK-32097
> URL: https://issues.apache.org/jira/browse/SPARK-32097
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> We would like to have a regex kind support in displaying log files on the 
> Spark history server. This way we will be able to see applications that were 
> run on multiple clusters instead of just one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-02 Thread Gaurangi Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189668#comment-17189668
 ] 

Gaurangi Saxena commented on SPARK-32135:
-

[~tgraves] Apologies for the late response.

We would like to see the driver host of an application on the history server 
page. We could see it in executors page, but we would need 2 clicks to get 
there.

This feature is useful in a multi-cluster environment with a single JHS used to 
index history files for all clusters to understand on what cluster a job was 
executed. Driver-host will help identify the cluster. 

 

!image-2020-09-02-12-37-55-860.png!

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-02 Thread Gaurangi Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurangi Saxena updated SPARK-32135:

Attachment: image-2020-09-02-12-37-55-860.png

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-02 Thread Aman Rastogi (Jira)
Aman Rastogi created SPARK-32778:


 Summary: Accidental Data Deletion on calling saveAsTable
 Key: SPARK-32778
 URL: https://issues.apache.org/jira/browse/SPARK-32778
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Aman Rastogi


{code:java}
df.write.option("path", 
"/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
{code}
Above code deleted the data present in path "/already/existing/path". This 
happened because table was already not there in hive metastore however, path 
given had data. And if table is not present in Hive Metastore, SaveMode gets 
modified internally to SaveMode.Overwrite irrespective of what user has 
provided, which leads to data deletion. This change was introduced as part of 
https://issues.apache.org/jira/browse/SPARK-19583. 

Now, suppose if user is not using external hive metastore (hive metastore is 
associated with a cluster) and if cluster goes down or due to some reason user 
has to migrate to a new cluster. Once user tries to save data using above code 
in new cluster, it will first delete the data. It could be a production data 
and user is completely unaware of it as they have provided SaveMode.Append or 
ErrorIfExists. This will be an accidental data deletion.

 

Proposed Fix:

Instead of modifying SaveMode to Overwrite, we should modify it to 
ErrorIfExists in class CreateDataSourceTableAsSelectCommand.

Change (line 154)

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
false)
 
{code}
to

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists 
= false){code}
This should not break CTAS. Even in case of CTAS, user may not want to delete 
data if already exists as it could be accidental.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

2020-09-02 Thread Jeff Harrison (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189634#comment-17189634
 ] 

Jeff Harrison commented on SPARK-28079:
---

I would also like something along these lines - it would greatly simplify our 
product.

columnNameOfCorruptRecord would ideally be created when mode is permissive.

columnNameOfCorruptRecord is not created when a schema is created via headers – 
I would have to load the data twice. This solution wouldn't work at all for 
some of our JSON data.

> CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is 
> manually added to the schema
> -
>
> Key: SPARK-28079
> URL: https://issues.apache.org/jira/browse/SPARK-28079
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.3
>Reporter: F Jimenez
>Priority: Major
>
> When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged 
> as such and read in. Only way to get them flagged is to manually set 
> "columnNameOfCorruptRecord" AND manually setting the schema including this 
> column. Example:
> {code:java}
> // Second row has a 4th column that is not declared in the header/schema
> val csvText = s"""
>  | FieldA, FieldB, FieldC
>  | a1,b1,c1
>  | a2,b2,c2,d*""".stripMargin
> val csvFile = new File("/tmp/file.csv")
> FileUtils.write(csvFile, csvText)
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
>   .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> This produces the correct result:
> {code:java}
> ++--+--+--+
> |corrupt |fieldA|fieldB|fieldC|
> ++--+--+--+
> |null    | a1   |b1    |c1    |
> | a2,b2,c2,d*| a2   |b2    |c2    |
> ++--+--+--+
> {code}
> However removing the "schema" option and going:
> {code:java}
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> Yields:
> {code:java}
> +---+---+---+
> | FieldA| FieldB| FieldC|
> +---+---+---+
> | a1    |b1 |c1 |
> | a2    |b2 |c2 |
> +---+---+---+
> {code}
> The fourth value "d*" in the second row has been removed and the row not 
> marked as corrupt
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32746) Not able to run Pandas UDF

2020-09-02 Thread Rahul Bhatia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189589#comment-17189589
 ] 

Rahul Bhatia commented on SPARK-32746:
--

Yes, i can run other PySpark codes easily, I tried changing from PySpark 3.0.0 
to PySpark 2.4.0, now it works, however upon looking at the Spark UI, I can see 
that it does not run in parallel, it runs only on one core on one of the 
executors, and all other executors are idle, I have about 20 executors, each 
with 16 Cores and 16 GB RAM, can you suggest what might be wrong here, I am 
partition on the GroupBy Key and have 200 partitions for now, can I change 
something to achieve maximum possible parallelism and utilise all executor 
cores?

> Not able to run Pandas UDF 
> ---
>
> Key: SPARK-32746
> URL: https://issues.apache.org/jira/browse/SPARK-32746
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Pyspark 3.0.0
> PyArrow - 1.0.1(also tried with Pyarrrow 0.15.1, no progress there)
> Pandas - 0.25.3
>  
>Reporter: Rahul Bhatia
>Priority: Major
> Attachments: Screenshot 2020-08-31 at 9.04.07 AM.png
>
>
> Hi,
> I am facing issues in running Pandas UDF on a yarn cluster with multiple 
> nodes, I am trying to perform a simple DBSCAN algorithm to multiple groups in 
> my dataframe, to start with, I am just using a simple example to test things 
> out - 
> {code:python}
> import pandas as pd
> from pyspark.sql.types import StructType, StructField, DoubleType, 
> StringType, IntegerType
> from sklearn.cluster import DBSCAN
> from pyspark.sql.functions import pandas_udf, PandasUDFTypedata 
> data = [(1, 11.6133, 48.1075),
>  (1, 11.6142, 48.1066),
>  (1, 11.6108, 48.1061),
>  (1, 11.6207, 48.1192),
>  (1, 11.6221, 48.1223),
>  (1, 11.5969, 48.1276),
>  (2, 11.5995, 48.1258),
>  (2, 11.6127, 48.1066),
>  (2, 11.6430, 48.1275),
>  (2, 11.6368, 48.1278),
>  (2, 11.5930, 48.1156)]
> df = spark.createDataFrame(data, ["id", "X", "Y"])
> output_schema = StructType(
> [
> StructField('id', IntegerType()),
> StructField('X', DoubleType()),
> StructField('Y', DoubleType()),
> StructField('cluster', IntegerType())
>  ]
> )
> @pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
> def dbscan(data):
> data["cluster"] = DBSCAN(eps=5, min_samples=3).fit_predict(data[["X", 
> "Y"]])
> result = pd.DataFrame(data, columns=["id", "X", "Y", "cluster"])
> return result
> res = df.groupby("id").apply(dbscan)
> res.show()
> {code}
>  
> The code keeps running forever on the yarn cluster, I expect it to be 
> finished within seconds(this works fine on standalone mode and finishes in 
> 2-4 seconds), on checking the Spark UI, I can see that the Spark job is 
> stuck(99/580) and doesn't make any progress forever.
>  
> Also it doesn't run in parallel, am I missing something?  !Screenshot 
> 2020-08-31 at 9.04.07 AM.png!
>  
>  
> I am new to Spark, and still trying to understand a lot of things. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32766) s3a: bucket names with dots cannot be used

2020-09-02 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189279#comment-17189279
 ] 

Steve Loughran commented on SPARK-32766:


not going to be fixed in the s3a code, even if there was an easy way. By the 
end of the month, it will be impossible to talk to any newly created S3 bucket 
containing a . in their name. Existing ones may work, but not in this case 
where the mixing of hostnames and numbers confuses the java URI parser

https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/


> s3a: bucket names with dots cannot be used
> --
>
> Key: SPARK-32766
> URL: https://issues.apache.org/jira/browse/SPARK-32766
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> Running vanilla spark with
> {noformat}
> --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat}
> I cannot read from S3, if the bucket name contains a dot (a valid name).
> A minimal reproducible example looks like this
> {{from pyspark.sql import SparkSession}}
> {{import pyspark.sql.functions as f}}
> {{if __name__ == '__main__':}}
> {{  spark = (SparkSession}}
> {{    .builder}}
> {{    .appName('my_app')}}
> {{    .master("local[*]")}}
> {{    .getOrCreate()}}
> {{  )}}
> {{  spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}}
> Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read 
> that CSV. I created the same bucket without the period and it worked fine.
> *Now I'm not sure whether this is a thing of prepping the path names and 
> passing them to the aws-sdk, or whether the fault is within the SDK itself. I 
> am not Java savvy to investigate the issue further, but I tried to make the 
> repro as short as possible.*
> 
> I get different errors depending on which Hadoop distributions I use. If I 
> use the default PySpark distribution (which includes Hadoop 2), I get the 
> following (using hadoop-aws:2.7.4)
> {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}}
> {{java.lang.IllegalArgumentException: The bucketName parameter must be 
> specified.}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}}
> {{ at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}}
> {{ at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}}
> {{ at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
> {{ ... 47 elided}}
> When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this 
> error (with hadoop-aws:3.2.0):
> {{java.lang.NullPointerException: null uri host.}}
> {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}}
> {{ at 
> org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}}
> {{ at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}}
> {{ at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}}
> {{ at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at 
> 

[jira] [Resolved] (SPARK-31670) Using complex type in Aggregation with cube failed Analysis error

2020-09-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31670.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28490
[https://github.com/apache/spark/pull/28490]

> Using complex type in Aggregation with cube failed Analysis error
> -
>
> Key: SPARK-31670
> URL: https://issues.apache.org/jira/browse/SPARK-31670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> Will wrong with below SQL
> {code:java}
> test("TEST STRUCT FIELD WITH GROUP BY with CUBE") {
>   withTable("t1") {
> sql(
>   """create table t1(
> |a string,
> |b int,
> |c array>)
> |using orc""".stripMargin)
> sql(
>   """
> |select a, 
> coalesce(get_json_object(each.json_string,'$.iType'),'-127') as iType, sum(b)
> |from t1
> |LATERAL VIEW explode(c) x AS each
> |group by a, get_json_object(each.json_string,'$.iType')
> |with cube
> |""".stripMargin).explain(true)
>   }
> }
> {code}
> Error 
> {code:java}
> expression 'x.`each`' is neither present in the group by, nor is it an 
> aggregate function. Add to group by or wrap in first() (or first_value) if 
> you don't care which value you get.;;
> Aggregate [a#230, get_json_object(each#222.json_string AS json_string#223, 
> $.iType)#231, spark_grouping_id#229L], [a#230, 
> coalesce(get_json_object(each#222.json_string, $.iType), -127) AS iType#218, 
> sum(cast(b#220 as bigint)) AS sum(b)#226L]
> +- Expand [List(a#219, b#220, c#221, each#222, a#227, 
> get_json_object(each#222.json_string AS json_string#223, $.iType)#228, 0), 
> List(a#219, b#220, c#221, each#222, a#227, null, 1), List(a#219, b#220, 
> c#221, each#222, null, get_json_object(each#222.json_string AS 
> json_string#223, $.iType)#228, 2), List(a#219, b#220, c#221, each#222, null, 
> null, 3)], [a#219, b#220, c#221, each#222, a#230, 
> get_json_object(each#222.json_string AS json_string#223, $.iType)#231, 
> spark_grouping_id#229L]
>+- Project [a#219, b#220, c#221, each#222, a#219 AS a#227, 
> get_json_object(each#222.json_string, $.iType) AS 
> get_json_object(each#222.json_string AS json_string#223, $.iType)#228]
>   +- Generate explode(c#221), false, x, [each#222]
>  +- SubqueryAlias spark_catalog.default.t1
> +- Relation[a#219,b#220,c#221] orc
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31670) Using complex type in Aggregation with cube failed Analysis error

2020-09-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31670:
---

Assignee: angerszhu

> Using complex type in Aggregation with cube failed Analysis error
> -
>
> Key: SPARK-31670
> URL: https://issues.apache.org/jira/browse/SPARK-31670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Will wrong with below SQL
> {code:java}
> test("TEST STRUCT FIELD WITH GROUP BY with CUBE") {
>   withTable("t1") {
> sql(
>   """create table t1(
> |a string,
> |b int,
> |c array>)
> |using orc""".stripMargin)
> sql(
>   """
> |select a, 
> coalesce(get_json_object(each.json_string,'$.iType'),'-127') as iType, sum(b)
> |from t1
> |LATERAL VIEW explode(c) x AS each
> |group by a, get_json_object(each.json_string,'$.iType')
> |with cube
> |""".stripMargin).explain(true)
>   }
> }
> {code}
> Error 
> {code:java}
> expression 'x.`each`' is neither present in the group by, nor is it an 
> aggregate function. Add to group by or wrap in first() (or first_value) if 
> you don't care which value you get.;;
> Aggregate [a#230, get_json_object(each#222.json_string AS json_string#223, 
> $.iType)#231, spark_grouping_id#229L], [a#230, 
> coalesce(get_json_object(each#222.json_string, $.iType), -127) AS iType#218, 
> sum(cast(b#220 as bigint)) AS sum(b)#226L]
> +- Expand [List(a#219, b#220, c#221, each#222, a#227, 
> get_json_object(each#222.json_string AS json_string#223, $.iType)#228, 0), 
> List(a#219, b#220, c#221, each#222, a#227, null, 1), List(a#219, b#220, 
> c#221, each#222, null, get_json_object(each#222.json_string AS 
> json_string#223, $.iType)#228, 2), List(a#219, b#220, c#221, each#222, null, 
> null, 3)], [a#219, b#220, c#221, each#222, a#230, 
> get_json_object(each#222.json_string AS json_string#223, $.iType)#231, 
> spark_grouping_id#229L]
>+- Project [a#219, b#220, c#221, each#222, a#219 AS a#227, 
> get_json_object(each#222.json_string, $.iType) AS 
> get_json_object(each#222.json_string AS json_string#223, $.iType)#228]
>   +- Generate explode(c#221), false, x, [each#222]
>  +- SubqueryAlias spark_catalog.default.t1
> +- Relation[a#219,b#220,c#221] orc
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32739) support prune right for left semi join in DPP

2020-09-02 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-32739.
-
Fix Version/s: 3.1.0
 Assignee: Zhenhua Wang
   Resolution: Fixed

Issue resolved by pull request 29582
https://github.com/apache/spark/pull/29582

> support prune right for left semi join in DPP
> -
>
> Key: SPARK-32739
> URL: https://issues.apache.org/jira/browse/SPARK-32739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-09-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32756.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29584
[https://github.com/apache/spark/pull/29584]

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Assignee: Karol Chmist
>Priority: Minor
> Fix For: 3.1.0
>
>
>  
> "Spark SQL" module doesn't compile in Scala 2.13:
> {code:java}
> [info] Compiling 26 Scala sources to 
> /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+("path".$minus$greater(path))
> [error] Error occurred in an application involving default arguments.
> [error] this.extraOptions += ("path" -> path)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] Error occurred in an application involving default arguments.
> [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: DataFrameWriter.this.extraOptions = 
> DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns)))
> [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY ->
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85:
>  type mismatch;
> [error] found : 
> scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] CaseInsensitiveMap(dedupPrimitiveFields)
> [error] ^
> [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] 
> <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] new CaseInsensitiveStringMap(withoutPath.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: 

[jira] [Assigned] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-09-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-32756:


Assignee: Karol Chmist

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Assignee: Karol Chmist
>Priority: Minor
>
>  
> "Spark SQL" module doesn't compile in Scala 2.13:
> {code:java}
> [info] Compiling 26 Scala sources to 
> /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+("path".$minus$greater(path))
> [error] Error occurred in an application involving default arguments.
> [error] this.extraOptions += ("path" -> path)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] Error occurred in an application involving default arguments.
> [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: DataFrameWriter.this.extraOptions = 
> DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns)))
> [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY ->
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85:
>  type mismatch;
> [error] found : 
> scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] CaseInsensitiveMap(dedupPrimitiveFields)
> [error] ^
> [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] 
> <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] new CaseInsensitiveStringMap(withoutPath.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [error] 7 errors found{code}
>  
> The + function in CaseInsensitiveStringMap missing, resulting in {{+}} 

[jira] [Commented] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189128#comment-17189128
 ] 

Hyukjin Kwon commented on SPARK-32776:
--

Sure, thanks [~kabhwan].

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Assignee: Liwen Sun
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32752) Alias breaks for interval typed literals

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32752:


Assignee: (was: Apache Spark)

> Alias breaks for interval typed literals
> 
>
> Key: SPARK-32752
> URL: https://issues.apache.org/jira/browse/SPARK-32752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Cases we found:
> {code:java}
> +-- !query
> +select interval '1 day' as day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +no viable alternative at input 'as'(line 1, pos 24)
> +
> +== SQL ==
> +select interval '1 day' as day
> +^^^
> +
> +
> +-- !query
> +select interval '1 day' day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, 
> pos 16)
> +
> +== SQL ==
> +select interval '1 day' day
> +^^^
> +
> +
> +-- !query
> +select interval '1-2' year as y
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16)
> +
> +== SQL ==
> +select interval '1-2' year as y
> +^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32752) Alias breaks for interval typed literals

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32752:


Assignee: Apache Spark

> Alias breaks for interval typed literals
> 
>
> Key: SPARK-32752
> URL: https://issues.apache.org/jira/browse/SPARK-32752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> Cases we found:
> {code:java}
> +-- !query
> +select interval '1 day' as day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +no viable alternative at input 'as'(line 1, pos 24)
> +
> +== SQL ==
> +select interval '1 day' as day
> +^^^
> +
> +
> +-- !query
> +select interval '1 day' day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, 
> pos 16)
> +
> +== SQL ==
> +select interval '1 day' day
> +^^^
> +
> +
> +-- !query
> +select interval '1-2' year as y
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16)
> +
> +== SQL ==
> +select interval '1-2' year as y
> +^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189126#comment-17189126
 ] 

Apache Spark commented on SPARK-32752:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29627

> Alias breaks for interval typed literals
> 
>
> Key: SPARK-32752
> URL: https://issues.apache.org/jira/browse/SPARK-32752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Cases we found:
> {code:java}
> +-- !query
> +select interval '1 day' as day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +no viable alternative at input 'as'(line 1, pos 24)
> +
> +== SQL ==
> +select interval '1 day' as day
> +^^^
> +
> +
> +-- !query
> +select interval '1 day' day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, 
> pos 16)
> +
> +== SQL ==
> +select interval '1 day' day
> +^^^
> +
> +
> +-- !query
> +select interval '1-2' year as y
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16)
> +
> +== SQL ==
> +select interval '1-2' year as y
> +^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-02 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189125#comment-17189125
 ] 

Jungtaek Lim commented on SPARK-32776:
--

It sounds to be safer to mark 3.0.x to 3.0.2 until the vote is open - it's 
relatively easier to find issues marked as 3.0.2 and make correction, assuming 
the case if the vote isn't going to the predicted way.

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Assignee: Liwen Sun
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-02 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32776:
-
Fix Version/s: (was: 3.0.1)
   3.0.2

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Assignee: Liwen Sun
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32768.
--
Resolution: Not A Problem

> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files. This option is helpful for compatibility 
> with external systems that need to read the output from Spark.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both the next 3.x release and 2.4.x release.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32776:


Assignee: Liwen Sun

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Assignee: Liwen Sun
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32776.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189091#comment-17189091
 ] 

Hyukjin Kwon commented on SPARK-32776:
--

Fixed in https://github.com/apache/spark/pull/29623

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Priority: Major
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-09-02 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated SPARK-32317:
---
Labels:   (was: easyfix)

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> 

[jira] [Updated] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-09-02 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated SPARK-32317:
---
Component/s: (was: PySpark)
 SQL

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> 

[jira] [Assigned] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32777:


Assignee: (was: Apache Spark)

> Aggregation support aggregate function with multiple foldable expressions.
> --
>
> Key: SPARK-32777
> URL: https://issues.apache.org/jira/browse/SPARK-32777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL exists a bug show below:
> {code:java}
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 2, 3)|
> +-++
> |1|   1|
> +-++
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 3, 2)|
> +-++
> |1|   0|
> +-++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.

2020-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189044#comment-17189044
 ] 

Apache Spark commented on SPARK-32777:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29626

> Aggregation support aggregate function with multiple foldable expressions.
> --
>
> Key: SPARK-32777
> URL: https://issues.apache.org/jira/browse/SPARK-32777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL exists a bug show below:
> {code:java}
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 2, 3)|
> +-++
> |1|   1|
> +-++
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 3, 2)|
> +-++
> |1|   0|
> +-++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.

2020-09-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32777:


Assignee: Apache Spark

> Aggregation support aggregate function with multiple foldable expressions.
> --
>
> Key: SPARK-32777
> URL: https://issues.apache.org/jira/browse/SPARK-32777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL exists a bug show below:
> {code:java}
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 2, 3)|
> +-++
> |1|   1|
> +-++
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 3, 2)|
> +-++
> |1|   0|
> +-++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.

2020-09-02 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-32777:
---
Description: 
Spark SQL exists a bug show below:

{code:java}
spark.sql(
  " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
  .show()
+-++
|count(DISTINCT 2)|count(DISTINCT 3, 2)|
+-++
|1|   0|
+-++
{code}



> Aggregation support aggregate function with multiple foldable expressions.
> --
>
> Key: SPARK-32777
> URL: https://issues.apache.org/jira/browse/SPARK-32777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL exists a bug show below:
> {code:java}
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 3, 2)|
> +-++
> |1|   0|
> +-++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32770) Add missing imports

2020-09-02 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved SPARK-32770.
--
Resolution: Won't Fix

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32770) Add missing imports

2020-09-02 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189030#comment-17189030
 ] 

Rohit Mishra commented on SPARK-32770:
--

[~fokko], While you are working on the PR, can you please add a description?

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.

2020-09-02 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-32777:
--

 Summary: Aggregation support aggregate function with multiple 
foldable expressions.
 Key: SPARK-32777
 URL: https://issues.apache.org/jira/browse/SPARK-32777
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.

2020-09-02 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-32777:
---
Description: 
Spark SQL exists a bug show below:

{code:java}
spark.sql(
  " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)")
  .show()
+-++
|count(DISTINCT 2)|count(DISTINCT 2, 3)|
+-++
|1|   1|
+-++

spark.sql(
  " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
  .show()
+-++
|count(DISTINCT 2)|count(DISTINCT 3, 2)|
+-++
|1|   0|
+-++
{code}



  was:
Spark SQL exists a bug show below:

{code:java}
spark.sql(
  " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
  .show()
+-++
|count(DISTINCT 2)|count(DISTINCT 3, 2)|
+-++
|1|   0|
+-++
{code}




> Aggregation support aggregate function with multiple foldable expressions.
> --
>
> Key: SPARK-32777
> URL: https://issues.apache.org/jira/browse/SPARK-32777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL exists a bug show below:
> {code:java}
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 2, 3)|
> +-++
> |1|   1|
> +-++
> spark.sql(
>   " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
>   .show()
> +-++
> |count(DISTINCT 2)|count(DISTINCT 3, 2)|
> +-++
> |1|   0|
> +-++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24528) Add support to read multiple sorted bucket files for data source v1

2020-09-02 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189007#comment-17189007
 ] 

Cheng Su commented on SPARK-24528:
--

Change the title of Jira to `Add support to read multiple sorted bucket files 
for data source v1`, for describing the actual issue more correctly.

> Add support to read multiple sorted bucket files for data source v1
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24528) Add support to read multiple sorted bucket files for data source v1

2020-09-02 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-24528:
-
Summary: Add support to read multiple sorted bucket files for data source 
v1  (was: Missing optimization for Aggregations/Windowing on a bucketed table)

> Add support to read multiple sorted bucket files for data source v1
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org