[jira] [Created] (SPARK-33092) Support subexpression elimination in ProjectExec

2020-10-07 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33092:
---

 Summary: Support subexpression elimination in ProjectExec
 Key: SPARK-33092
 URL: https://issues.apache.org/jira/browse/SPARK-33092
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Users frequently write repeatedly expression in projection. Currently in 
ProjectExec, we don't support subexpression elimination in Whole-stage codegen. 
We can support it to reduce redundant evaluation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog

2020-10-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33074.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29952
[https://github.com/apache/spark/pull/29952]

> Classify dialect exceptions in JDBC v2 Table Catalog
> 
>
> Key: SPARK-33074
> URL: https://issues.apache.org/jira/browse/SPARK-33074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The current implementation of v2.jdbc.JDBCTableCatalog  don't care of 
> exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at 
> all like
> * NoSuchNamespaceException
> * NoSuchTableException
> * TableAlreadyExistsException
> it either throw dialect exception or generic exception AnalysisException.
> Since we split forming of dialect specific statements and their execution, we 
> should extend dialect APIs and ask them how to convert their exceptions to 
> TableCatalog  exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog

2020-10-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33074:
---

Assignee: Maxim Gekk

> Classify dialect exceptions in JDBC v2 Table Catalog
> 
>
> Key: SPARK-33074
> URL: https://issues.apache.org/jira/browse/SPARK-33074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The current implementation of v2.jdbc.JDBCTableCatalog  don't care of 
> exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at 
> all like
> * NoSuchNamespaceException
> * NoSuchTableException
> * TableAlreadyExistsException
> it either throw dialect exception or generic exception AnalysisException.
> Since we split forming of dialect specific statements and their execution, we 
> should extend dialect APIs and ask them how to convert their exceptions to 
> TableCatalog  exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210015#comment-17210015
 ] 

Apache Spark commented on SPARK-33091:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29974

> Avoid using map instead of foreach to avoid potential side effect at callers 
> of OrcUtils.readCatalystSchema
> ---
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33091:


Assignee: Apache Spark

> Avoid using map instead of foreach to avoid potential side effect at callers 
> of OrcUtils.readCatalystSchema
> ---
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33091:


Assignee: (was: Apache Spark)

> Avoid using map instead of foreach to avoid potential side effect at callers 
> of OrcUtils.readCatalystSchema
> ---
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210014#comment-17210014
 ] 

Apache Spark commented on SPARK-33091:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29974

> Avoid using map instead of foreach to avoid potential side effect at callers 
> of OrcUtils.readCatalystSchema
> ---
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33091:
-
Summary: Avoid using map instead of foreach to avoid potential side effect 
at callers of OrcUtils.readCatalystSchema  (was: Avoid using map instead of 
foreach to avoid potential side effect at callee of OrcUtils.readCatalystSchema)

> Avoid using map instead of foreach to avoid potential side effect at callers 
> of OrcUtils.readCatalystSchema
> ---
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callee of OrcUtils.readCatalystSchema

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33091:
-
Summary: Avoid using map instead of foreach to avoid potential side effect 
at callee of OrcUtils.readCatalystSchema  (was: Avoid using map instead of 
foreach to avoid potential side effect at OrcUtils.readCatalystSchema)

> Avoid using map instead of foreach to avoid potential side effect at callee 
> of OrcUtils.readCatalystSchema
> --
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33091:
-
Priority: Minor  (was: Major)

> Avoid using map instead of foreach to avoid potential side effect at 
> OrcUtils.readCatalystSchema
> 
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema

2020-10-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33091:


 Summary: Avoid using map instead of foreach to avoid potential 
side effect at OrcUtils.readCatalystSchema
 Key: SPARK-33091
 URL: https://issues.apache.org/jira/browse/SPARK-33091
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Hyukjin Kwon


This is a kind of a followup of SPARK-32646. New JIRA was filed to control the 
fixed versions properly.

When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
this,  we should better use {{foreach}}.

See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33091:
-
Issue Type: Improvement  (was: Bug)

> Avoid using map instead of foreach to avoid potential side effect at 
> OrcUtils.readCatalystSchema
> 
>
> Key: SPARK-33091
> URL: https://issues.apache.org/jira/browse/SPARK-33091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This is a kind of a followup of SPARK-32646. New JIRA was filed to control 
> the fixed versions properly.
> When you use {{map}}, it might be lazily evaluated and not executed. To avoid 
> this,  we should better use {{foreach}}.
> See also SPARK-16694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32282) Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection

2020-10-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32282.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29074
[https://github.com/apache/spark/pull/29074]

> Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as 
> PartitioningCollection
> 
>
> Key: SPARK-32282
> URL: https://issues.apache.org/jira/browse/SPARK-32282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> The EnsureRquirement.reorderJoinKeys can be improved to handle the following 
> scenarios:
> # If the keys cannot be reordered to match the left-side HashPartitioning, 
> consider the right-side HashPartitioning.
> # Handle PartitioningCollection, which may contain HashPartitioning



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map

2020-10-07 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209990#comment-17209990
 ] 

Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:59 AM:


I found that if  stringToMap use codegen, the optimization of 
`spark.sql.subexpressionElimination.enabled` will be ignored.


was (Author: luciferyang):
I found that if  stringToMap use codegen, the optimization of 
`spark.sql.subexpressionElimination.enabled` will be ignored.

> Performance regression when selecting from str_to_map
> -
>
> Key: SPARK-32989
> URL: https://issues.apache.org/jira/browse/SPARK-32989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When I create a map using str_to_map and select more than a single value, I 
> notice a notable performance regression in 3.0.1 compared to 2.4.7. When 
> selecting a single value, the performance is the same. Plans are identical 
> between versions.
> It seems like in 2.x the map from str_to_map is preserved for a given row, 
> but in 3.x it's recalculated for each column. One hint that it might be the 
> case is that when I tried forcing materialisation of said map in 3.x (by a 
> coalesce, don't know if there's a better way), I got the performance roughly 
> to 2.x levels.
> Here's a reproducer (the csv in question gets autogenerated by the python 
> code):
> {code:java}
> $ head regression.csv 
> foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> ... (10M more rows)
> {code}
> {code:python}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar=bak=foo\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .select(
> f.col('my_map')['foo'],
> )
> )
> dd.write.mode('overwrite').csv('tmp')
> t2 = time.time()
> print('selected one', t2 - t)
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> # .coalesce(100) # forcing evaluation before selection speeds it 
> up in 3.0.1
> .select(
> f.col('my_map')['foo'],
> f.col('my_map')['bar'],
> f.col('my_map')['baz'],
> )
> )
> dd.explain(True)
> dd.write.mode('overwrite').csv('tmp')
> t3 = time.time()
> print('selected three', t3 - t2)
> {code}
> Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS 
> (times are in seconds)
> {code:java}
> # 3.0.1
> # selected one 6.375471830368042  
> 
> # selected three 14.847578048706055
> # 2.4.7
> # selected one 6.679579019546509  
> 
> # selected three 6.5622029304504395  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32989) Performance regression when selecting from str_to_map

2020-10-07 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209990#comment-17209990
 ] 

Yang Jie commented on SPARK-32989:
--

I found that if  stringToMap use codegen, the optimization of 
`spark.sql.subexpressionElimination.enabled` will be ignored.

> Performance regression when selecting from str_to_map
> -
>
> Key: SPARK-32989
> URL: https://issues.apache.org/jira/browse/SPARK-32989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When I create a map using str_to_map and select more than a single value, I 
> notice a notable performance regression in 3.0.1 compared to 2.4.7. When 
> selecting a single value, the performance is the same. Plans are identical 
> between versions.
> It seems like in 2.x the map from str_to_map is preserved for a given row, 
> but in 3.x it's recalculated for each column. One hint that it might be the 
> case is that when I tried forcing materialisation of said map in 3.x (by a 
> coalesce, don't know if there's a better way), I got the performance roughly 
> to 2.x levels.
> Here's a reproducer (the csv in question gets autogenerated by the python 
> code):
> {code:java}
> $ head regression.csv 
> foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> ... (10M more rows)
> {code}
> {code:python}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar=bak=foo\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .select(
> f.col('my_map')['foo'],
> )
> )
> dd.write.mode('overwrite').csv('tmp')
> t2 = time.time()
> print('selected one', t2 - t)
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> # .coalesce(100) # forcing evaluation before selection speeds it 
> up in 3.0.1
> .select(
> f.col('my_map')['foo'],
> f.col('my_map')['bar'],
> f.col('my_map')['baz'],
> )
> )
> dd.explain(True)
> dd.write.mode('overwrite').csv('tmp')
> t3 = time.time()
> print('selected three', t3 - t2)
> {code}
> Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS 
> (times are in seconds)
> {code:java}
> # 3.0.1
> # selected one 6.375471830368042  
> 
> # selected three 14.847578048706055
> # 2.4.7
> # selected one 6.679579019546509  
> 
> # selected three 6.5622029304504395  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map

2020-10-07 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209979#comment-17209979
 ] 

Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:24 AM:


[~ondrej] You're right, It will execute N times with codegen(SPARK-30356.) when 
selecting N columns use stringToMap expression compared to selecting One 
column, cc [~Qin Yao] [~cloud_fan]


was (Author: luciferyang):
[~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when 
select n columns use stringToMap expression, cc [~Qin Yao] [~cloud_fan]

> Performance regression when selecting from str_to_map
> -
>
> Key: SPARK-32989
> URL: https://issues.apache.org/jira/browse/SPARK-32989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When I create a map using str_to_map and select more than a single value, I 
> notice a notable performance regression in 3.0.1 compared to 2.4.7. When 
> selecting a single value, the performance is the same. Plans are identical 
> between versions.
> It seems like in 2.x the map from str_to_map is preserved for a given row, 
> but in 3.x it's recalculated for each column. One hint that it might be the 
> case is that when I tried forcing materialisation of said map in 3.x (by a 
> coalesce, don't know if there's a better way), I got the performance roughly 
> to 2.x levels.
> Here's a reproducer (the csv in question gets autogenerated by the python 
> code):
> {code:java}
> $ head regression.csv 
> foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> ... (10M more rows)
> {code}
> {code:python}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar=bak=foo\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .select(
> f.col('my_map')['foo'],
> )
> )
> dd.write.mode('overwrite').csv('tmp')
> t2 = time.time()
> print('selected one', t2 - t)
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> # .coalesce(100) # forcing evaluation before selection speeds it 
> up in 3.0.1
> .select(
> f.col('my_map')['foo'],
> f.col('my_map')['bar'],
> f.col('my_map')['baz'],
> )
> )
> dd.explain(True)
> dd.write.mode('overwrite').csv('tmp')
> t3 = time.time()
> print('selected three', t3 - t2)
> {code}
> Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS 
> (times are in seconds)
> {code:java}
> # 3.0.1
> # selected one 6.375471830368042  
> 
> # selected three 14.847578048706055
> # 2.4.7
> # selected one 6.679579019546509  
> 
> # selected three 6.5622029304504395  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map

2020-10-07 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209979#comment-17209979
 ] 

Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:23 AM:


[~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when 
select n columns use stringToMap expression, cc [~Qin Yao] [~cloud_fan]


was (Author: luciferyang):
[~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when 
select n columns use stringToMap expression, cc [~Qin Yao]

> Performance regression when selecting from str_to_map
> -
>
> Key: SPARK-32989
> URL: https://issues.apache.org/jira/browse/SPARK-32989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When I create a map using str_to_map and select more than a single value, I 
> notice a notable performance regression in 3.0.1 compared to 2.4.7. When 
> selecting a single value, the performance is the same. Plans are identical 
> between versions.
> It seems like in 2.x the map from str_to_map is preserved for a given row, 
> but in 3.x it's recalculated for each column. One hint that it might be the 
> case is that when I tried forcing materialisation of said map in 3.x (by a 
> coalesce, don't know if there's a better way), I got the performance roughly 
> to 2.x levels.
> Here's a reproducer (the csv in question gets autogenerated by the python 
> code):
> {code:java}
> $ head regression.csv 
> foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> ... (10M more rows)
> {code}
> {code:python}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar=bak=foo\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .select(
> f.col('my_map')['foo'],
> )
> )
> dd.write.mode('overwrite').csv('tmp')
> t2 = time.time()
> print('selected one', t2 - t)
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> # .coalesce(100) # forcing evaluation before selection speeds it 
> up in 3.0.1
> .select(
> f.col('my_map')['foo'],
> f.col('my_map')['bar'],
> f.col('my_map')['baz'],
> )
> )
> dd.explain(True)
> dd.write.mode('overwrite').csv('tmp')
> t3 = time.time()
> print('selected three', t3 - t2)
> {code}
> Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS 
> (times are in seconds)
> {code:java}
> # 3.0.1
> # selected one 6.375471830368042  
> 
> # selected three 14.847578048706055
> # 2.4.7
> # selected one 6.679579019546509  
> 
> # selected three 6.5622029304504395  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map

2020-10-07 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209979#comment-17209979
 ] 

Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:22 AM:


[~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when 
select n columns use stringToMap expression, cc [~Qin Yao]


was (Author: luciferyang):
[~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when 
select n columns use stringToMap expression.

> Performance regression when selecting from str_to_map
> -
>
> Key: SPARK-32989
> URL: https://issues.apache.org/jira/browse/SPARK-32989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When I create a map using str_to_map and select more than a single value, I 
> notice a notable performance regression in 3.0.1 compared to 2.4.7. When 
> selecting a single value, the performance is the same. Plans are identical 
> between versions.
> It seems like in 2.x the map from str_to_map is preserved for a given row, 
> but in 3.x it's recalculated for each column. One hint that it might be the 
> case is that when I tried forcing materialisation of said map in 3.x (by a 
> coalesce, don't know if there's a better way), I got the performance roughly 
> to 2.x levels.
> Here's a reproducer (the csv in question gets autogenerated by the python 
> code):
> {code:java}
> $ head regression.csv 
> foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> ... (10M more rows)
> {code}
> {code:python}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar=bak=foo\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .select(
> f.col('my_map')['foo'],
> )
> )
> dd.write.mode('overwrite').csv('tmp')
> t2 = time.time()
> print('selected one', t2 - t)
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> # .coalesce(100) # forcing evaluation before selection speeds it 
> up in 3.0.1
> .select(
> f.col('my_map')['foo'],
> f.col('my_map')['bar'],
> f.col('my_map')['baz'],
> )
> )
> dd.explain(True)
> dd.write.mode('overwrite').csv('tmp')
> t3 = time.time()
> print('selected three', t3 - t2)
> {code}
> Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS 
> (times are in seconds)
> {code:java}
> # 3.0.1
> # selected one 6.375471830368042  
> 
> # selected three 14.847578048706055
> # 2.4.7
> # selected one 6.679579019546509  
> 
> # selected three 6.5622029304504395  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32989) Performance regression when selecting from str_to_map

2020-10-07 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209979#comment-17209979
 ] 

Yang Jie commented on SPARK-32989:
--

[~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when 
select n columns use stringToMap expression.

> Performance regression when selecting from str_to_map
> -
>
> Key: SPARK-32989
> URL: https://issues.apache.org/jira/browse/SPARK-32989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When I create a map using str_to_map and select more than a single value, I 
> notice a notable performance regression in 3.0.1 compared to 2.4.7. When 
> selecting a single value, the performance is the same. Plans are identical 
> between versions.
> It seems like in 2.x the map from str_to_map is preserved for a given row, 
> but in 3.x it's recalculated for each column. One hint that it might be the 
> case is that when I tried forcing materialisation of said map in 3.x (by a 
> coalesce, don't know if there's a better way), I got the performance roughly 
> to 2.x levels.
> Here's a reproducer (the csv in question gets autogenerated by the python 
> code):
> {code:java}
> $ head regression.csv 
> foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> foo=bar=bak=foo
> ... (10M more rows)
> {code}
> {code:python}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar=bak=foo\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .select(
> f.col('my_map')['foo'],
> )
> )
> dd.write.mode('overwrite').csv('tmp')
> t2 = time.time()
> print('selected one', t2 - t)
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> # .coalesce(100) # forcing evaluation before selection speeds it 
> up in 3.0.1
> .select(
> f.col('my_map')['foo'],
> f.col('my_map')['bar'],
> f.col('my_map')['baz'],
> )
> )
> dd.explain(True)
> dd.write.mode('overwrite').csv('tmp')
> t3 = time.time()
> print('selected three', t3 - t2)
> {code}
> Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS 
> (times are in seconds)
> {code:java}
> # 3.0.1
> # selected one 6.375471830368042  
> 
> # selected three 14.847578048706055
> # 2.4.7
> # selected one 6.679579019546509  
> 
> # selected three 6.5622029304504395  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33089.
--
Fix Version/s: 3.1.0
   3.0.2
 Assignee: Yuning Zhang
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/29971

> avro format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> ---
>
> Key: SPARK-33089
> URL: https://issues.apache.org/jira/browse/SPARK-33089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Assignee: Yuning Zhang
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("avro").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32793.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29947
[https://github.com/apache/spark/pull/29947]

> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Assignee: Karen Feng
>Priority: Minor
> Fix For: 3.1.0
>
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32793:


Assignee: Karen Feng

> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Assignee: Karen Feng
>Priority: Minor
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209970#comment-17209970
 ] 

Apache Spark commented on SPARK-20202:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29973

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0, 3.1.0
>Reporter: Owen O'Malley
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209967#comment-17209967
 ] 

Apache Spark commented on SPARK-20202:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29973

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0, 3.1.0
>Reporter: Owen O'Malley
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209969#comment-17209969
 ] 

Apache Spark commented on SPARK-20202:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29973

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0, 3.1.0
>Reporter: Owen O'Malley
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33090) Upgrade Google Guava

2020-10-07 Thread Stephen Coy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209968#comment-17209968
 ] 

Stephen Coy commented on SPARK-33090:
-

I can create a PR for this if you like...

 

> Upgrade Google Guava
> 
>
> Key: SPARK-33090
> URL: https://issues.apache.org/jira/browse/SPARK-33090
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Stephen Coy
>Priority: Major
>
> Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using 
> features from newer versions of Google Guava.
> This leads to MethodNotFound exceptions, etc in Spark builds that specify 
> newer versions of Hadoop. I believe this is due to the use of new methods in 
> com.google.common.base.Preconditions.
> The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently 
> glued to guava-14.0.1.
> I have been running a Spark cluster with the version bumped to guava-29.0-jre 
> without issue.
> Partly due to the way Spark is built, this change is a little more 
> complicated that just changing the version, because newer versions of guava 
> have a new dependency on com.google.guava:failureaccess:1.0.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33082) Remove hive-1.2 workaround code

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209966#comment-17209966
 ] 

Apache Spark commented on SPARK-33082:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29973

> Remove hive-1.2 workaround code
> ---
>
> Key: SPARK-33082
> URL: https://issues.apache.org/jira/browse/SPARK-33082
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33082) Remove hive-1.2 workaround code

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209965#comment-17209965
 ] 

Apache Spark commented on SPARK-33082:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29973

> Remove hive-1.2 workaround code
> ---
>
> Key: SPARK-33082
> URL: https://issues.apache.org/jira/browse/SPARK-33082
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33090) Upgrade Google Guava

2020-10-07 Thread Stephen Coy (Jira)
Stephen Coy created SPARK-33090:
---

 Summary: Upgrade Google Guava
 Key: SPARK-33090
 URL: https://issues.apache.org/jira/browse/SPARK-33090
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.1
Reporter: Stephen Coy


Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using 
features from newer versions of Google Guava.

This leads to MethodNotFound exceptions, etc in Spark builds that specify newer 
versions of Hadoop. I believe this is due to the use of new methods in 
com.google.common.base.Preconditions.

The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently 
glued to guava-14.0.1.

I have been running a Spark cluster with the version bumped to guava-29.0-jre 
without issue.

Partly due to the way Spark is built, this change is a little more complicated 
that just changing the version, because newer versions of guava have a new 
dependency on com.google.guava:failureaccess:1.0.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32960) Provide better exception on temporary view against DataFrameWriterV2

2020-10-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-32960.
--
Resolution: Won't Fix

Superceded by SPARK-33087

> Provide better exception on temporary view against DataFrameWriterV2
> 
>
> Key: SPARK-32960
> URL: https://issues.apache.org/jira/browse/SPARK-32960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> DataFrameWriterV2 doesn't handle fail-back if catalog.loadTable doesn't 
> provide any Table instance. This ends up leading temp view to 
> NoSuchTableException.
> It's OK to fail for such case unless we want to resolve it later like 
> DataFrameWriter.insertInto, but throwing NoSuchTableException is probably 
> confusing, as view is loaded via catalog.loadTable and fails with capability 
> check, not NoSuchTableException.
> We could check in prior whether the table identifier refers temp view, and 
> provide better exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33086.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29969
[https://github.com/apache/spark/pull/29969]

> Provide static annotatiions for pyspark.resource.* modules
> --
>
> Key: SPARK-33086
> URL: https://issues.apache.org/jira/browse/SPARK-33086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.1.0
>
>
> At the point of port {{pyspark.resource}} had only dynamic annotations 
> generated using {{stubgen}}.
> Since they are a part of a public API, we should provide static annotations 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33086:


Assignee: Maciej Szymkiewicz

> Provide static annotatiions for pyspark.resource.* modules
> --
>
> Key: SPARK-33086
> URL: https://issues.apache.org/jira/browse/SPARK-33086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> At the point of port {{pyspark.resource}} had only dynamic annotations 
> generated using {{stubgen}}.
> Since they are a part of a public API, we should provide static annotations 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31913) StackOverflowError in FileScanRDD

2020-10-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31913.
--
Resolution: Cannot Reproduce

> StackOverflowError in FileScanRDD
> -
>
> Key: SPARK-31913
> URL: https://issues.apache.org/jira/browse/SPARK-31913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Priority: Minor
>
> Reading from FileScanRDD may failed with a StackOverflowError in my 
> environment:
> - There are a mass of empty files in table partition。
> - Set `spark.sql.files.maxPartitionBytes`  with a large value: 1024MB
> A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small 
> value, like default 128MB.
> A better way is resolve the recursive calls in FileScanRDD.
> {code}
> java.lang.StackOverflowError
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.getSubject(Subject.java:297)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31913) StackOverflowError in FileScanRDD

2020-10-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209946#comment-17209946
 ] 

Takeshi Yamamuro commented on SPARK-31913:
--

Since this issue looks env-dependent and the PR was automatically closed, I 
will close this.

> StackOverflowError in FileScanRDD
> -
>
> Key: SPARK-31913
> URL: https://issues.apache.org/jira/browse/SPARK-31913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Priority: Minor
>
> Reading from FileScanRDD may failed with a StackOverflowError in my 
> environment:
> - There are a mass of empty files in table partition。
> - Set `spark.sql.files.maxPartitionBytes`  with a large value: 1024MB
> A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small 
> value, like default 128MB.
> A better way is resolve the recursive calls in FileScanRDD.
> {code}
> java.lang.StackOverflowError
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.getSubject(Subject.java:297)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2020-10-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209936#comment-17209936
 ] 

Dongjoon Hyun commented on SPARK-28067:
---

[~anuragmantri] For this one, this is not backported to 3.0.0, too.

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Mark Sirek
>Assignee: Sunitha Kambhampati
>Priority: Critical
>  Labels: correctness
> Fix For: 3.1.0
>
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32978) Incorrect number of dynamic part metric

2020-10-07 Thread Aoyuan Liao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aoyuan Liao updated SPARK-32978:

Description: 
How to reproduce this issue:
{code:sql}
create table dynamic_partition(i bigint, part bigint) using parquet partitioned 
by (part);
insert overwrite table dynamic_partition partition(part) select id, id % 50 as 
part  from range(1);
{code}
The number of dynamic part should be 50, but it is 800 on web UI.

  was:
How to reproduce this issue:
{code:sql}
create table dynamic_partition(i bigint, part bigint) using parquet partitioned 
by (part);
insert overwrite table dynamic_partition partition(part) select id, id % 50 as 
part  from range(1);
{code}

The number of dynamic part should be 50, but it is 800.



> Incorrect number of dynamic part metric
> ---
>
> Key: SPARK-32978
> URL: https://issues.apache.org/jira/browse/SPARK-32978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table dynamic_partition(i bigint, part bigint) using parquet 
> partitioned by (part);
> insert overwrite table dynamic_partition partition(part) select id, id % 50 
> as part  from range(1);
> {code}
> The number of dynamic part should be 50, but it is 800 on web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16859) History Server storage information is missing

2020-10-07 Thread Aoyuan Liao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209932#comment-17209932
 ] 

Aoyuan Liao commented on SPARK-16859:
-

"spark.eventLog.logBlockUpdates.enabled=true" works for me on Spark 3.0.1

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2020-10-07 Thread Anurag Mantripragada (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209923#comment-17209923
 ] 

Anurag Mantripragada commented on SPARK-28067:
--

I just checked the issue exists in branch-2.4. Since this is a `correctness` 
issue, should we backport it to branch-2.4? 
cc: [~cloud_fan], [~dongjoon]

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Mark Sirek
>Assignee: Sunitha Kambhampati
>Priority: Critical
>  Labels: correctness
> Fix For: 3.1.0
>
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209919#comment-17209919
 ] 

Apache Spark commented on SPARK-33081:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29972

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33081:


Assignee: (was: Apache Spark)

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33081:


Assignee: Apache Spark

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21708) use sbt 1.x

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-21708:
-

Assignee: Denis Pyshev

> use sbt 1.x
> ---
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: PJ Fanning
>Assignee: Denis Pyshev
>Priority: Minor
> Fix For: 3.1.0
>
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21708) use sbt 1.x

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-21708.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29286
[https://github.com/apache/spark/pull/29286]

> use sbt 1.x
> ---
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: PJ Fanning
>Priority: Minor
> Fix For: 3.1.0
>
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209876#comment-17209876
 ] 

Apache Spark commented on SPARK-32001:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29968

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209875#comment-17209875
 ] 

Apache Spark commented on SPARK-32001:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29968

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209873#comment-17209873
 ] 

Apache Spark commented on SPARK-33089:
--

User 'yuningzh-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/29971

> avro format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> ---
>
> Key: SPARK-33089
> URL: https://issues.apache.org/jira/browse/SPARK-33089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Priority: Major
>
> When running:
> {code:java}
> spark.read.format("avro").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33089:


Assignee: (was: Apache Spark)

> avro format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> ---
>
> Key: SPARK-33089
> URL: https://issues.apache.org/jira/browse/SPARK-33089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Priority: Major
>
> When running:
> {code:java}
> spark.read.format("avro").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33089:


Assignee: Apache Spark

> avro format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> ---
>
> Key: SPARK-33089
> URL: https://issues.apache.org/jira/browse/SPARK-33089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Assignee: Apache Spark
>Priority: Major
>
> When running:
> {code:java}
> spark.read.format("avro").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-07 Thread Yuning Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuning Zhang updated SPARK-33089:
-
Description: 
When running:
{code:java}
spark.read.format("avro").options(conf).load(path)
{code}
The underlying file system will not receive the `conf` options.

  was:
When running:
{code:java}
spark.read.format("avro").options(conf).load(path)
{code}
The 


> avro format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> ---
>
> Key: SPARK-33089
> URL: https://issues.apache.org/jira/browse/SPARK-33089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Priority: Major
>
> When running:
> {code:java}
> spark.read.format("avro").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-07 Thread Yuning Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuning Zhang updated SPARK-33089:
-
Description: 
When running:
{code:java}
spark.read.format("avro").options(conf).load(path)
{code}
The 

> avro format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> ---
>
> Key: SPARK-33089
> URL: https://issues.apache.org/jira/browse/SPARK-33089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Priority: Major
>
> When running:
> {code:java}
> spark.read.format("avro").options(conf).load(path)
> {code}
> The 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-07 Thread Yuning Zhang (Jira)
Yuning Zhang created SPARK-33089:


 Summary: avro format does not propagate Hadoop config from DS 
options to underlying HDFS file system
 Key: SPARK-33089
 URL: https://issues.apache.org/jira/browse/SPARK-33089
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuning Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33019) Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

2020-10-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209864#comment-17209864
 ] 

Dongjoon Hyun commented on SPARK-33019:
---

[~ste...@apache.org]. The user sill use v2 committer if they already set the 
conf explicitly. In addition, the user still can use v2 committer if they want.
>  you can still use v2 committer

We only prevent the users blindly expect the same behavior during migration 
from Apache Spark 3.0 to Apache Spark 3.1.

> Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> -
>
> Key: SPARK-33019
> URL: https://issues.apache.org/jira/browse/SPARK-33019
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> By default, Spark should use a safe file output committer algorithm to avoid 
> MAPREDUCE-7282.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-33042) Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime

2020-10-07 Thread Yuning Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuning Zhang closed SPARK-33042.


> Add a test case to ensure changes to spark.sql.optimizer.maxIterations take 
> effect at runtime
> -
>
> Key: SPARK-33042
> URL: https://issues.apache.org/jira/browse/SPARK-33042
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Assignee: Yuning Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> **Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` 
> take effect at runtime.
> Currently, there is only one related test case: 
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156]
> However, this test case only checks the value of the conf can be changed at 
> runtime. It does not check the updated value is actually used by the 
> Optimizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog

2020-10-07 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33074:
---
Parent: SPARK-24907
Issue Type: Sub-task  (was: Improvement)

> Classify dialect exceptions in JDBC v2 Table Catalog
> 
>
> Key: SPARK-33074
> URL: https://issues.apache.org/jira/browse/SPARK-33074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The current implementation of v2.jdbc.JDBCTableCatalog  don't care of 
> exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at 
> all like
> * NoSuchNamespaceException
> * NoSuchTableException
> * TableAlreadyExistsException
> it either throw dialect exception or generic exception AnalysisException.
> Since we split forming of dialect specific statements and their execution, we 
> should extend dialect APIs and ask them how to convert their exceptions to 
> TableCatalog  exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33082) Remove hive-1.2 workaround code

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33082:
-

Assignee: Dongjoon Hyun

> Remove hive-1.2 workaround code
> ---
>
> Key: SPARK-33082
> URL: https://issues.apache.org/jira/browse/SPARK-33082
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33082) Remove hive-1.2 workaround code

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33082.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29961
[https://github.com/apache/spark/pull/29961]

> Remove hive-1.2 workaround code
> ---
>
> Key: SPARK-33082
> URL: https://issues.apache.org/jira/browse/SPARK-33082
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33088) Enhance ExecutorPlugin API to include methods for task start and end events

2020-10-07 Thread Samuel Souza (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Samuel Souza updated SPARK-33088:
-
Description: 
On [SPARK-24918|https://issues.apache.org/jira/browse/SPARK-24918]'s 
[SIPP|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#],
 it was raised to potentially add methods to ExecutorPlugin interface on task 
start and end:

{quote}The basic interface can just be a marker trait, as that allows a plugin 
to monitor general characteristics of the JVM (eg. monitor memory or take 
thread dumps).   Optionally, we could include methods for task start and end 
events.   This would allow more control on monitoring – eg., you could start 
polling thread dumps only if there was a task from a particular stage that had 
been taking too long. But anything task related is a bit trickier to decide the 
right api. Should the task end event also get the failure reason? Should those 
events get called in the same thread as the task runner, or in another thread?
{quote}

The ask is to add exactly that. I've put up a draft PR [in our fork of 
spark|https://github.com/palantir/spark/pull/713] and I'm happy to push it 
upstream. Also happy to receive comments on what's the right interface to 
expose - not opinionated on that front, tried to expose the simplest interface 
for now.

The main reason for this ask is to propagate tracing information from the 
driver to the executors 
([SPARK-21962|https://issues.apache.org/jira/browse/SPARK-21962] has some 
context). On [HADOOP-15566|https://issues.apache.org/jira/browse/HADOOP-15566] 
I see we're discussing how to add tracing to the Apache ecosystem, but my 
problem is slightly different: I want to use this interface to propagate 
tracing information to my framework of choice. If the Hadoop issue gets solved 
we'll have a framework to communicate tracing information inside the Apache 
ecosystem, but it's highly unlikely that all Spark users will use the same 
common framework. Therefore we should still provide plugin interfaces where the 
tracing information can be propagated appropriately.

To give more color, in our case the tracing information is [stored in a thread 
local|https://github.com/palantir/tracing-java/blob/4.9.0/tracing/src/main/java/com/palantir/tracing/Tracer.java#L61],
 therefore it needs to be set in the same thread which is executing the task. 
[*]

While our framework is specific, I imagine such an interface could be useful in 
general. Happy to hear your thoughts about it.

[*] Something I did not mention was how to propagate the tracing information 
from the driver to the executors. For that I intend to use 1. the driver's 
localProperties, which 2. will be eventually propagated to the executors' 
TaskContext, which 3. I'll be able to access from the methods above.

  was:
On https://issues.apache.org/jira/browse/SPARK-24918's 
[SIPP|[https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#]],
 it was raised to potentially add methods to ExecutorPlugin interface on task 
start and end:
{quote}The basic interface can just be a marker trait, as that allows a plugin 
to monitor general characteristics of the JVM (eg. monitor memory or take 
thread dumps).   Optionally, we could include methods for task start and end 
events.   This would allow more control on monitoring -- eg., you could start 
polling thread dumps only if there was a task from a particular stage that had 
been taking too long. But anything task related is a bit trickier to decide the 
right api. Should the task end event also get the failure reason? Should those 
events get called in the same thread as the task runner, or in another thread?
{quote}
The ask is to add exactly that. I've put up a draft PR in our fork of spark 
[here| [https://github.com/palantir/spark/pull/713]] and I'm happy to push it 
upstream. Also happy to receive comments on what's the right interface to 
expose - not opinionated on that front, tried to expose the simplest interface 
for now.

The main reason for this ask is to propagate tracing information from the 
driver to the executors (https://issues.apache.org/jira/browse/SPARK-21962 has 
some context). On https://issues.apache.org/jira/browse/HADOOP-15566 I see 
we're discussing how to add tracing to the Apache ecosystem, but my problem is 
slightly different: I want to use this interface to propagate tracing 
information to my framework of choice. If the Hadoop issue gets solved we'll 
have a framework to communicate tracing information inside the Apache 
ecosystem, but it's highly unlikely that all Spark users will use the same 
common framework. Therefore we should still provide plugin interfaces where the 
tracing 

[jira] [Created] (SPARK-33088) Enhance ExecutorPlugin API to include methods for task start and end events

2020-10-07 Thread Samuel Souza (Jira)
Samuel Souza created SPARK-33088:


 Summary: Enhance ExecutorPlugin API to include methods for task 
start and end events
 Key: SPARK-33088
 URL: https://issues.apache.org/jira/browse/SPARK-33088
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Samuel Souza


On https://issues.apache.org/jira/browse/SPARK-24918's 
[SIPP|[https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#]],
 it was raised to potentially add methods to ExecutorPlugin interface on task 
start and end:
{quote}The basic interface can just be a marker trait, as that allows a plugin 
to monitor general characteristics of the JVM (eg. monitor memory or take 
thread dumps).   Optionally, we could include methods for task start and end 
events.   This would allow more control on monitoring -- eg., you could start 
polling thread dumps only if there was a task from a particular stage that had 
been taking too long. But anything task related is a bit trickier to decide the 
right api. Should the task end event also get the failure reason? Should those 
events get called in the same thread as the task runner, or in another thread?
{quote}
The ask is to add exactly that. I've put up a draft PR in our fork of spark 
[here| [https://github.com/palantir/spark/pull/713]] and I'm happy to push it 
upstream. Also happy to receive comments on what's the right interface to 
expose - not opinionated on that front, tried to expose the simplest interface 
for now.

The main reason for this ask is to propagate tracing information from the 
driver to the executors (https://issues.apache.org/jira/browse/SPARK-21962 has 
some context). On https://issues.apache.org/jira/browse/HADOOP-15566 I see 
we're discussing how to add tracing to the Apache ecosystem, but my problem is 
slightly different: I want to use this interface to propagate tracing 
information to my framework of choice. If the Hadoop issue gets solved we'll 
have a framework to communicate tracing information inside the Apache 
ecosystem, but it's highly unlikely that all Spark users will use the same 
common framework. Therefore we should still provide plugin interfaces where the 
tracing information can be propagated appropriately.

To give more color, in our case the tracing information is [stored in a thread 
local|[https://github.com/palantir/tracing-java/blob/develop/tracing/src/main/java/com/palantir/tracing/Tracer.java#L61]],
 therefore it needs to be set in the same thread which is executing the task. 
[*]

While our framework is specific, I imagine such an interface could be useful in 
general. Happy to hear your thoughts about it.

[*] Something I did not mention was how to propagate the tracing information 
from the driver to the executors. For that I intend to use 1. the driver's 
localProperties, which 2. will be eventually propagated to the executors' 
TaskContext, which 3. I'll be able to access from the methods above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33019) Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

2020-10-07 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209766#comment-17209766
 ] 

Steve Loughran commented on SPARK-33019:


Related to this, I'm proposing we add a method which will let the MR engine and 
spark driver work out if a committer can be recovered from -and choose how to 
react if it says "no" - fail or warn + commit another attempt

That way if you want full due diligence you can still use v2 committer, (or EMR 
committer), but get the ability to make failures during the commit phase 
something which triggers a failure. Most of the time, it won't.


> Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> -
>
> Key: SPARK-33019
> URL: https://issues.apache.org/jira/browse/SPARK-33019
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> By default, Spark should use a safe file output committer algorithm to avoid 
> MAPREDUCE-7282.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27484) Create the streaming writing logical plan node before query is analyzed

2020-10-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209755#comment-17209755
 ] 

Dongjoon Hyun commented on SPARK-27484:
---

It seems that [~kabhwan] also hits this issue and documents it in his 
SPARK-32896 PR like the following.
- 
https://github.com/apache/spark/pull/29767/files#diff-d35e8fce09686073f81de598ed657de7R314-R319
{code}
  // Currently we don't create a logical streaming writer node in logical 
plan, so cannot rely
   // on analyzer to resolve it. Directly lookup only for temp view to 
provide clearer message.
   // TODO (SPARK-27484): we should add the writing node before the plan is 
analyzed.
   if 
(df.sparkSession.sessionState.catalog.isTempView(originalMultipartIdentifier)) {
 throw new AnalysisException(s"Temporary view $tableName doesn't 
support streaming write")
   }
{code}

> Create the streaming writing logical plan node before query is analyzed
> ---
>
> Key: SPARK-27484
> URL: https://issues.apache.org/jira/browse/SPARK-27484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27484) Create the streaming writing logical plan node before query is analyzed

2020-10-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209755#comment-17209755
 ] 

Dongjoon Hyun edited comment on SPARK-27484 at 10/7/20, 6:17 PM:
-

It seems that [~kabhwan] also hits this issue and documents it in his 
SPARK-32896 PR like the following.
- 
https://github.com/apache/spark/pull/29767/files#diff-d35e8fce09686073f81de598ed657de7R314-R319
{code}
// Currently we don't create a logical streaming writer node in logical plan, 
so cannot rely
// on analyzer to resolve it. Directly lookup only for temp view to provide 
clearer message.
// TODO (SPARK-27484): we should add the writing node before the plan is 
analyzed.
if 
(df.sparkSession.sessionState.catalog.isTempView(originalMultipartIdentifier)) {
  throw new AnalysisException(s"Temporary view $tableName doesn't support 
streaming write")
}
{code}


was (Author: dongjoon):
It seems that [~kabhwan] also hits this issue and documents it in his 
SPARK-32896 PR like the following.
- 
https://github.com/apache/spark/pull/29767/files#diff-d35e8fce09686073f81de598ed657de7R314-R319
{code}
  // Currently we don't create a logical streaming writer node in logical 
plan, so cannot rely
   // on analyzer to resolve it. Directly lookup only for temp view to 
provide clearer message.
   // TODO (SPARK-27484): we should add the writing node before the plan is 
analyzed.
   if 
(df.sparkSession.sessionState.catalog.isTempView(originalMultipartIdentifier)) {
 throw new AnalysisException(s"Temporary view $tableName doesn't 
support streaming write")
   }
{code}

> Create the streaming writing logical plan node before query is analyzed
> ---
>
> Key: SPARK-27484
> URL: https://issues.apache.org/jira/browse/SPARK-27484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209712#comment-17209712
 ] 

Apache Spark commented on SPARK-33087:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29970

> DataFrameWriterV2 should delegate table resolution to the analyzer
> --
>
> Key: SPARK-33087
> URL: https://issues.apache.org/jira/browse/SPARK-33087
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33087:


Assignee: Apache Spark  (was: Wenchen Fan)

> DataFrameWriterV2 should delegate table resolution to the analyzer
> --
>
> Key: SPARK-33087
> URL: https://issues.apache.org/jira/browse/SPARK-33087
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33087:


Assignee: Wenchen Fan  (was: Apache Spark)

> DataFrameWriterV2 should delegate table resolution to the analyzer
> --
>
> Key: SPARK-33087
> URL: https://issues.apache.org/jira/browse/SPARK-33087
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer

2020-10-07 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-33087:
---

 Summary: DataFrameWriterV2 should delegate table resolution to the 
analyzer
 Key: SPARK-33087
 URL: https://issues.apache.org/jira/browse/SPARK-33087
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33005) Kubernetes GA Preparation

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33005:
-

Assignee: Dongjoon Hyun

> Kubernetes GA Preparation
> -
>
> Key: SPARK-33005
> URL: https://issues.apache.org/jira/browse/SPARK-33005
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33005) Kubernetes GA Preparation

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33005:
--
Target Version/s: 3.1.0

> Kubernetes GA Preparation
> -
>
> Key: SPARK-33005
> URL: https://issues.apache.org/jira/browse/SPARK-33005
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32067:
-

Assignee: Stijn De Haes

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0
>Reporter: James Yu
>Assignee: Stijn De Haes
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32067.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29934
[https://github.com/apache/spark/pull/29934]

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0
>Reporter: James Yu
>Assignee: Stijn De Haes
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32714:
-
Comment: was deleted

(was: User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969)

> Port pyspark-stubs
> --
>
> Key: SPARK-32714
> URL: https://issues.apache.org/jira/browse/SPARK-32714
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://github.com/zero323/pyspark-stubs into PySpark. This was being 
> discussed in dev mailing list. See also 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32714:
-
Comment: was deleted

(was: User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969)

> Port pyspark-stubs
> --
>
> Key: SPARK-32714
> URL: https://issues.apache.org/jira/browse/SPARK-32714
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://github.com/zero323/pyspark-stubs into PySpark. This was being 
> discussed in dev mailing list. See also 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32714:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/29591)

> Port pyspark-stubs
> --
>
> Key: SPARK-32714
> URL: https://issues.apache.org/jira/browse/SPARK-32714
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://github.com/zero323/pyspark-stubs into PySpark. This was being 
> discussed in dev mailing list. See also 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32714:
-
Comment: was deleted

(was: User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969)

> Port pyspark-stubs
> --
>
> Key: SPARK-32714
> URL: https://issues.apache.org/jira/browse/SPARK-32714
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://github.com/zero323/pyspark-stubs into PySpark. This was being 
> discussed in dev mailing list. See also 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209640#comment-17209640
 ] 

Apache Spark commented on SPARK-33086:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969

> Provide static annotatiions for pyspark.resource.* modules
> --
>
> Key: SPARK-33086
> URL: https://issues.apache.org/jira/browse/SPARK-33086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the point of port {{pyspark.resource}} had only dynamic annotations 
> generated using {{stubgen}}.
> Since they are a part of a public API, we should provide static annotations 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33086:


Assignee: Apache Spark

> Provide static annotatiions for pyspark.resource.* modules
> --
>
> Key: SPARK-33086
> URL: https://issues.apache.org/jira/browse/SPARK-33086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> At the point of port {{pyspark.resource}} had only dynamic annotations 
> generated using {{stubgen}}.
> Since they are a part of a public API, we should provide static annotations 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209638#comment-17209638
 ] 

Apache Spark commented on SPARK-33086:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969

> Provide static annotatiions for pyspark.resource.* modules
> --
>
> Key: SPARK-33086
> URL: https://issues.apache.org/jira/browse/SPARK-33086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the point of port {{pyspark.resource}} had only dynamic annotations 
> generated using {{stubgen}}.
> Since they are a part of a public API, we should provide static annotations 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33086:


Assignee: (was: Apache Spark)

> Provide static annotatiions for pyspark.resource.* modules
> --
>
> Key: SPARK-33086
> URL: https://issues.apache.org/jira/browse/SPARK-33086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the point of port {{pyspark.resource}} had only dynamic annotations 
> generated using {{stubgen}}.
> Since they are a part of a public API, we should provide static annotations 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33086:
-
Parent: SPARK-32681
Issue Type: Sub-task  (was: Improvement)

> Provide static annotatiions for pyspark.resource.* modules
> --
>
> Key: SPARK-33086
> URL: https://issues.apache.org/jira/browse/SPARK-33086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the point of port {{pyspark.resource}} had only dynamic annotations 
> generated using {{stubgen}}.
> Since they are a part of a public API, we should provide static annotations 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules

2020-10-07 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-33086:
--

 Summary: Provide static annotatiions for pyspark.resource.* modules
 Key: SPARK-33086
 URL: https://issues.apache.org/jira/browse/SPARK-33086
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Maciej Szymkiewicz


At the point of port {{pyspark.resource}} had only dynamic annotations 
generated using {{stubgen}}.

Since they are a part of a public API, we should provide static annotations 
instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32714) Port pyspark-stubs

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209635#comment-17209635
 ] 

Apache Spark commented on SPARK-32714:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969

> Port pyspark-stubs
> --
>
> Key: SPARK-32714
> URL: https://issues.apache.org/jira/browse/SPARK-32714
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://github.com/zero323/pyspark-stubs into PySpark. This was being 
> discussed in dev mailing list. See also 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32714) Port pyspark-stubs

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209634#comment-17209634
 ] 

Apache Spark commented on SPARK-32714:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969

> Port pyspark-stubs
> --
>
> Key: SPARK-32714
> URL: https://issues.apache.org/jira/browse/SPARK-32714
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://github.com/zero323/pyspark-stubs into PySpark. This was being 
> discussed in dev mailing list. See also 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32982) Remove hive-1.2 profiles in PIP installation option

2020-10-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209633#comment-17209633
 ] 

Hyukjin Kwon commented on SPARK-32982:
--

 https://github.com/apache/spark/pull/29878 was a followup.

> Remove hive-1.2 profiles in PIP installation option
> ---
>
> Key: SPARK-32982
> URL: https://issues.apache.org/jira/browse/SPARK-32982
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Hive 1.2 is a fork that we should remove. It's best to don't expose this 
> distribution from pip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32714) Port pyspark-stubs

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209628#comment-17209628
 ] 

Apache Spark commented on SPARK-32714:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29969

> Port pyspark-stubs
> --
>
> Key: SPARK-32714
> URL: https://issues.apache.org/jira/browse/SPARK-32714
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://github.com/zero323/pyspark-stubs into PySpark. This was being 
> discussed in dev mailing list. See also 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209612#comment-17209612
 ] 

Apache Spark commented on SPARK-26499:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29968

> JdbcUtils.makeGetter does not handle ByteType
> -
>
> Key: SPARK-26499
> URL: https://issues.apache.org/jira/browse/SPARK-26499
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas D'Silva
>Assignee: Thomas D'Silva
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I am trying to use the  DataSource V2 API to read from a JDBC source. While 
> using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row 
> from a ResultSet that has a column of type TINYINT I ran into the following 
> exception
> {code:java}
> java.lang.IllegalArgumentException: Unsupported type tinyint
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340)
> {code}
> This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33085) "Master removed our application" error leads to FAILED driver status instead of KILLED driver status

2020-10-07 Thread t oo (Jira)
t oo created SPARK-33085:


 Summary: "Master removed our application" error leads to FAILED 
driver status instead of KILLED driver status
 Key: SPARK-33085
 URL: https://issues.apache.org/jira/browse/SPARK-33085
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.4.6
Reporter: t oo


 

driver-20200930160855-0316 exited with status FAILED

 

I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that 
myip.87 EC2 instance was terminated at 2020-09-30 16:16

 

*I would expect the overall driver status to be KILLED but instead it was 
FAILED*, my goal is to interpret FAILED status as 'don't rerun as non-transient 
error faced' but KILLED/ERROR status as 'yes, rerun as transient error faced'. 
But it looks like FAILED status is being set in below case of transient error:

  

Below are driver logs
{code:java}
2020-09-30 16:12:41,183 [main] INFO  
com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
s3a://redacted2020-09-30 16:12:41,183 [main] INFO  
com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR 
org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,372 
[dispatcher-event-loop-15] WARN  org.apache.spark.scheduler.TaskSetManager - 
Lost task 0.0 in stage 6.0 (TID 6, myip.87, executor 0): ExecutorLostFailure 
(executor 0 exited caused by one of the running tasks) Reason: Remote RPC 
client disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.2020-09-30 16:16:40,376 
[dispatcher-event-loop-13] WARN  
org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas 
available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 16:16:40,399 
[dispatcher-event-loop-2] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 
[dispatcher-event-loop-5] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 
[dispatcher-event-loop-11] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 
[dispatcher-event-loop-1] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 
[dispatcher-event-loop-12] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/5 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,410 
[dispatcher-event-loop-5] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/6 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,420 [dispatcher-event-loop-9] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/6 removed: 

[jira] [Commented] (SPARK-32511) Add dropFields method to Column class

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209504#comment-17209504
 ] 

Apache Spark commented on SPARK-32511:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29967

> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Assignee: fqaiser94
>Priority: Major
> Fix For: 3.1.0
>
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class. 
> This method should allow users to drop a column nested inside a StructType 
> Column (with similar semantics to the existing {{drop}} method on 
> {{Dataset}}).
> It should also be able to handle deeply nested columns through the same API. 
> This is similar to the {{withField}} method that was recently added in 
> SPARK-31317 and likely we can re-use some of that "infrastructure."
> The public-facing method signature should be something along the following 
> lines: 
> {noformat}
> def dropFields(fieldNames: String*): Column
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33084) Add jar support ivy path

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209493#comment-17209493
 ] 

Apache Spark commented on SPARK-33084:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29966

> Add jar support ivy path
> 
>
> Key: SPARK-33084
> URL: https://issues.apache.org/jira/browse/SPARK-33084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Support add jar with ivy path



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33084) Add jar support ivy path

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33084:


Assignee: Apache Spark

> Add jar support ivy path
> 
>
> Key: SPARK-33084
> URL: https://issues.apache.org/jira/browse/SPARK-33084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Support add jar with ivy path



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33084) Add jar support ivy path

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33084:


Assignee: (was: Apache Spark)

> Add jar support ivy path
> 
>
> Key: SPARK-33084
> URL: https://issues.apache.org/jira/browse/SPARK-33084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Support add jar with ivy path



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33036) Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a bottom-up manner

2020-10-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33036.
--
Fix Version/s: 3.1.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29913

> Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a 
> bottom-up manner
> --
>
> Key: SPARK-33036
> URL: https://issues.apache.org/jira/browse/SPARK-33036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> This PR aims at refactoring code in `RewriteCorrelatedScalarSubquery` for 
> replacing `ExprId`s in a bottom-up manner instead of doing in a top-down one.
> This PR comes from the talk with @cloud-fan in 
> https://github.com/apache/spark/pull/29585#discussion_r490371252.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33084) Add jar support ivy path

2020-10-07 Thread angerszhu (Jira)
angerszhu created SPARK-33084:
-

 Summary: Add jar support ivy path
 Key: SPARK-33084
 URL: https://issues.apache.org/jira/browse/SPARK-33084
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.1.0
Reporter: angerszhu


Support add jar with ivy path



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33002) Post-port removal of non-API stubs

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33002:


Assignee: Maciej Szymkiewicz

> Post-port removal of non-API stubs
> --
>
> Key: SPARK-33002
> URL: https://issues.apache.org/jira/browse/SPARK-33002
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> To simplify initial port we merge all existing stubs.
> However, some of these cover non-API components and are usually dynamically 
> annotated (generated with stubgen).
> This includes modules like {{serializers}}, {{utils}}, {{shell}}, {{worker}}, 
> etc.
> These can be safely removed as:
> - MyPy can infer types from the source, where stub is not present.
> - No longer provide value, when corresponding modules are present in the same 
> directory structure.
> - Annotations are here primarily to help end users, not Spark developers and 
> many of the annotations cannot be meaningfully refined.
> It should also reduce overhead of maintaining annotations (especially when 
> places where we don't guarantee signature stability).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33002) Post-port removal of non-API stubs

2020-10-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33002.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29879
[https://github.com/apache/spark/pull/29879]

> Post-port removal of non-API stubs
> --
>
> Key: SPARK-33002
> URL: https://issues.apache.org/jira/browse/SPARK-33002
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> To simplify initial port we merge all existing stubs.
> However, some of these cover non-API components and are usually dynamically 
> annotated (generated with stubgen).
> This includes modules like {{serializers}}, {{utils}}, {{shell}}, {{worker}}, 
> etc.
> These can be safely removed as:
> - MyPy can infer types from the source, where stub is not present.
> - No longer provide value, when corresponding modules are present in the same 
> directory structure.
> - Annotations are here primarily to help end users, not Spark developers and 
> many of the annotations cannot be meaningfully refined.
> It should also reduce overhead of maintaining annotations (especially when 
> places where we don't guarantee signature stability).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33003) Add type hints guideliness to the documentation

2020-10-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209430#comment-17209430
 ] 

Hyukjin Kwon commented on SPARK-33003:
--

If you think it's useful to write some guides for users as well, it should 
likely be. Please feel free to go ahead as you go :-)

> Add type hints guideliness to the documentation
> ---
>
> Key: SPARK-33003
> URL: https://issues.apache.org/jira/browse/SPARK-33003
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33003) Add type hints guideliness to the documentation

2020-10-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209428#comment-17209428
 ] 

Hyukjin Kwon commented on SPARK-33003:
--

You mean the latter is a must have \(?\). Yeah, I think just doing it for dev 
is enough for now.

> Add type hints guideliness to the documentation
> ---
>
> Key: SPARK-33003
> URL: https://issues.apache.org/jira/browse/SPARK-33003
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-10-07 Thread Sachin Pasalkar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209425#comment-17209425
 ] 

Sachin Pasalkar commented on SPARK-30542:
-

[~kabhwan] Can't we make this configurable?

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33016:


Assignee: Apache Spark

> Potential SQLMetrics missed which might cause WEB UI display issue while AQE 
> is on.
> ---
>
> Key: SPARK-33016
> URL: https://issues.apache.org/jira/browse/SPARK-33016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Minor
>
> In current AQE execution, there might be a following scenario which might 
> cause SQLMetrics being incorrectly override.
>  # Stage A and B are created, and UI updated thru event 
> onAdaptiveExecutionUpdate.
>  # Stage A and B are running. Subquery in stage A keep updating metrics thru 
> event onAdaptiveSQLMetricUpdate.
>  # Stage B completes, while stage A's subquery is still running, updating 
> metrics.
>  # Completion of stage B triggers new stage creation and UI update thru event 
> onAdaptiveExecutionUpdate again (just like step 1).
>  
> But it's very hard to re-produce this issue, since it was only happened with 
> high concurrency. For the fix, I suggested that we might be able to keep all 
> duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-10-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33016:


Assignee: (was: Apache Spark)

> Potential SQLMetrics missed which might cause WEB UI display issue while AQE 
> is on.
> ---
>
> Key: SPARK-33016
> URL: https://issues.apache.org/jira/browse/SPARK-33016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In current AQE execution, there might be a following scenario which might 
> cause SQLMetrics being incorrectly override.
>  # Stage A and B are created, and UI updated thru event 
> onAdaptiveExecutionUpdate.
>  # Stage A and B are running. Subquery in stage A keep updating metrics thru 
> event onAdaptiveSQLMetricUpdate.
>  # Stage B completes, while stage A's subquery is still running, updating 
> metrics.
>  # Completion of stage B triggers new stage creation and UI update thru event 
> onAdaptiveExecutionUpdate again (just like step 1).
>  
> But it's very hard to re-produce this issue, since it was only happened with 
> high concurrency. For the fix, I suggested that we might be able to keep all 
> duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209423#comment-17209423
 ] 

Apache Spark commented on SPARK-33016:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29965

> Potential SQLMetrics missed which might cause WEB UI display issue while AQE 
> is on.
> ---
>
> Key: SPARK-33016
> URL: https://issues.apache.org/jira/browse/SPARK-33016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In current AQE execution, there might be a following scenario which might 
> cause SQLMetrics being incorrectly override.
>  # Stage A and B are created, and UI updated thru event 
> onAdaptiveExecutionUpdate.
>  # Stage A and B are running. Subquery in stage A keep updating metrics thru 
> event onAdaptiveSQLMetricUpdate.
>  # Stage B completes, while stage A's subquery is still running, updating 
> metrics.
>  # Completion of stage B triggers new stage creation and UI update thru event 
> onAdaptiveExecutionUpdate again (just like step 1).
>  
> But it's very hard to re-produce this issue, since it was only happened with 
> high concurrency. For the fix, I suggested that we might be able to keep all 
> duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-10-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209424#comment-17209424
 ] 

Apache Spark commented on SPARK-33016:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29965

> Potential SQLMetrics missed which might cause WEB UI display issue while AQE 
> is on.
> ---
>
> Key: SPARK-33016
> URL: https://issues.apache.org/jira/browse/SPARK-33016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In current AQE execution, there might be a following scenario which might 
> cause SQLMetrics being incorrectly override.
>  # Stage A and B are created, and UI updated thru event 
> onAdaptiveExecutionUpdate.
>  # Stage A and B are running. Subquery in stage A keep updating metrics thru 
> event onAdaptiveSQLMetricUpdate.
>  # Stage B completes, while stage A's subquery is still running, updating 
> metrics.
>  # Completion of stage B triggers new stage creation and UI update thru event 
> onAdaptiveExecutionUpdate again (just like step 1).
>  
> But it's very hard to re-produce this issue, since it was only happened with 
> high concurrency. For the fix, I suggested that we might be able to keep all 
> duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >