[jira] [Commented] (SPARK-24554) Add MapType Support for Arrow in PySpark

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233307#comment-17233307
 ] 

Apache Spark commented on SPARK-24554:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/30393

> Add MapType Support for Arrow in PySpark
> 
>
> Key: SPARK-24554
> URL: https://issues.apache.org/jira/browse/SPARK-24554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
> functionality in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24554) Add MapType Support for Arrow in PySpark

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24554:


Assignee: Apache Spark

> Add MapType Support for Arrow in PySpark
> 
>
> Key: SPARK-24554
> URL: https://issues.apache.org/jira/browse/SPARK-24554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>  Labels: bulk-closed
>
> Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
> functionality in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24554) Add MapType Support for Arrow in PySpark

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24554:


Assignee: (was: Apache Spark)

> Add MapType Support for Arrow in PySpark
> 
>
> Key: SPARK-24554
> URL: https://issues.apache.org/jira/browse/SPARK-24554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
> functionality in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-16 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33465:
---

 Summary: RDD.takeOrdered should get rid of usage of reduce or use 
treeReduce instead
 Key: SPARK-33465
 URL: https://issues.apache.org/jira/browse/SPARK-33465
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


{{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted 
elements into a `BoundedPriorityQueue`.  So actually in the result RDD each 
partition has only one priority queue. Then the API calls {{RDD.reduce}} API to 
reduce the elements. But as mentioned before the RDD has only one queue at each 
partition, it doesn't make sense to call reduce to reduce elements (here the 
element is queue).

We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or replace 
it with {{treeReduce}} which can actually do partially reducing for this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33465:


Assignee: L. C. Hsieh  (was: Apache Spark)

> RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
> ---
>
> Key: SPARK-33465
> URL: https://issues.apache.org/jira/browse/SPARK-33465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted 
> elements into a `BoundedPriorityQueue`.  So actually in the result RDD each 
> partition has only one priority queue. Then the API calls {{RDD.reduce}} API 
> to reduce the elements. But as mentioned before the RDD has only one queue at 
> each partition, it doesn't make sense to call reduce to reduce elements (here 
> the element is queue).
> We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or 
> replace it with {{treeReduce}} which can actually do partially reducing for 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233285#comment-17233285
 ] 

Apache Spark commented on SPARK-33465:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30392

> RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
> ---
>
> Key: SPARK-33465
> URL: https://issues.apache.org/jira/browse/SPARK-33465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted 
> elements into a `BoundedPriorityQueue`.  So actually in the result RDD each 
> partition has only one priority queue. Then the API calls {{RDD.reduce}} API 
> to reduce the elements. But as mentioned before the RDD has only one queue at 
> each partition, it doesn't make sense to call reduce to reduce elements (here 
> the element is queue).
> We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or 
> replace it with {{treeReduce}} which can actually do partially reducing for 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33465:


Assignee: Apache Spark  (was: L. C. Hsieh)

> RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
> ---
>
> Key: SPARK-33465
> URL: https://issues.apache.org/jira/browse/SPARK-33465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted 
> elements into a `BoundedPriorityQueue`.  So actually in the result RDD each 
> partition has only one priority queue. Then the API calls {{RDD.reduce}} API 
> to reduce the elements. But as mentioned before the RDD has only one queue at 
> each partition, it doesn't make sense to call reduce to reduce elements (here 
> the element is queue).
> We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or 
> replace it with {{treeReduce}} which can actually do partially reducing for 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233284#comment-17233284
 ] 

Apache Spark commented on SPARK-33465:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30392

> RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
> ---
>
> Key: SPARK-33465
> URL: https://issues.apache.org/jira/browse/SPARK-33465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted 
> elements into a `BoundedPriorityQueue`.  So actually in the result RDD each 
> partition has only one priority queue. Then the API calls {{RDD.reduce}} API 
> to reduce the elements. But as mentioned before the RDD has only one queue at 
> each partition, it doesn't make sense to call reduce to reduce elements (here 
> the element is queue).
> We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or 
> replace it with {{treeReduce}} which can actually do partially reducing for 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Nilesh Patil (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233279#comment-17233279
 ] 

Nilesh Patil edited comment on SPARK-33395 at 11/17/20, 5:57 AM:
-

In decimal type scale and precision will be constant for all rows, where as all 
row may vary its scale and precision. So decimal will not solve our problem.


was (Author: nileshpatil1992):
In decimal type scale and precision will be constant for all rows, where as all 
row may vary its scale and precision.

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Major
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Nilesh Patil (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233279#comment-17233279
 ] 

Nilesh Patil commented on SPARK-33395:
--

In decimal type scale and precision will be constant for all rows, where as all 
row may vary its scale and precision.

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Major
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233276#comment-17233276
 ] 

Takeshi Yamamuro commented on SPARK-33395:
--

How about using a decimal type instead? 

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Major
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Nilesh Patil (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233271#comment-17233271
 ] 

Nilesh Patil commented on SPARK-33395:
--

Can we have any method implementation which will persist the datatype but and 
will display the data in original form ? Something like this. As this is issue 
is reported by our 3 clients on production so wanted to take it on high 
priority. 

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Major
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33395:
-
Issue Type: Improvement  (was: Bug)

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Critical
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233267#comment-17233267
 ] 

Takeshi Yamamuro commented on SPARK-33395:
--

hm, but we cannot avoid the rounding in this case, I think. Any idea? Either 
way, I think this is an expected behaviour, so I will change "Bug" -> 
"Improvement" now.

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Critical
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33395:
-
Priority: Major  (was: Critical)

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Major
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Nilesh Patil (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233264#comment-17233264
 ] 

Nilesh Patil commented on SPARK-33395:
--

Yes inferred type is double, but we want the data as it is as it have in file 
not in the scientific notation also not in the form rounding.

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Critical
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33395:
-
Affects Version/s: 3.1.0
   2.4.8

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Critical
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33395:
-
Component/s: (was: Spark Core)
 SQL

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.8, 3.1.0
>Reporter: Nilesh Patil
>Priority: Critical
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233263#comment-17233263
 ] 

Takeshi Yamamuro commented on SPARK-33395:
--

The inferred type is double, so they are approximate values. What do you 
suggest here? You think we should use decimal in this case instead?

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Critical
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33407) Simplify the exception message from Python UDFs

2020-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33407.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30309
[https://github.com/apache/spark/pull/30309]

> Simplify the exception message from Python UDFs
> ---
>
> Key: SPARK-33407
> URL: https://issues.apache.org/jira/browse/SPARK-33407
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the exception message is as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/sql/dataframe.py", line 427, in show
> print(self._jdf.showString(n, 20, vertical))
>   File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, 
> in __call__
>   File "/.../python/pyspark/sql/utils.py", line 127, in deco
> raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.PythonException:
>   An exception was thrown from Python worker in the executor:
> Traceback (most recent call last):
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
> process()
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
> serializer.dump_stream(out_iter, outfile)
>   File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in 
> dump_stream
> for obj in iterator:
>   File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in 
> _batched
> for item in iterator:
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
> udfs)
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in 
> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
> udfs)
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in 
> return lambda *a: f(*a)
>   File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
> return f(*args, **kwargs)
>   File "", line 3, in divide_by_zero
> ZeroDivisionError: division by zero
> {code}
> Actually, almost all cases, users only care about {{ZeroDivisionError: 
> division by zero
> }}. We don't really have to show the internal stuff in 99% cases.
> We could just make it short, for example,
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/sql/dataframe.py", line 427, in show
> print(self._jdf.showString(n, 20, vertical))
>   File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, 
> in __call__
>   File "/.../python/pyspark/sql/utils.py", line 127, in deco
> raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.PythonException:
>   An exception was thrown from Python worker in the executor:
> Traceback (most recent call last):
>   File "", line 3, in divide_by_zero
> ZeroDivisionError: division by zero
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33407) Simplify the exception message from Python UDFs

2020-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33407:


Assignee: Hyukjin Kwon

> Simplify the exception message from Python UDFs
> ---
>
> Key: SPARK-33407
> URL: https://issues.apache.org/jira/browse/SPARK-33407
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, the exception message is as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/sql/dataframe.py", line 427, in show
> print(self._jdf.showString(n, 20, vertical))
>   File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, 
> in __call__
>   File "/.../python/pyspark/sql/utils.py", line 127, in deco
> raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.PythonException:
>   An exception was thrown from Python worker in the executor:
> Traceback (most recent call last):
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
> process()
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
> serializer.dump_stream(out_iter, outfile)
>   File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in 
> dump_stream
> for obj in iterator:
>   File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in 
> _batched
> for item in iterator:
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
> udfs)
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in 
> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
> udfs)
>   File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in 
> return lambda *a: f(*a)
>   File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
> return f(*args, **kwargs)
>   File "", line 3, in divide_by_zero
> ZeroDivisionError: division by zero
> {code}
> Actually, almost all cases, users only care about {{ZeroDivisionError: 
> division by zero
> }}. We don't really have to show the internal stuff in 99% cases.
> We could just make it short, for example,
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/sql/dataframe.py", line 427, in show
> print(self._jdf.showString(n, 20, vertical))
>   File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, 
> in __call__
>   File "/.../python/pyspark/sql/utils.py", line 127, in deco
> raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.PythonException:
>   An exception was thrown from Python worker in the executor:
> Traceback (most recent call last):
>   File "", line 3, in divide_by_zero
> ZeroDivisionError: division by zero
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33464:


Assignee: (was: Apache Spark)

> Add/remove (un)necessary cache and restructure GitHub Actions yaml
> --
>
> Key: SPARK-33464
> URL: https://issues.apache.org/jira/browse/SPARK-33464
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, GitHub Actions build has some unnecessary cache/commands. For 
> example, if you run SBT only .m2 cache is not needed. We should clean up and 
> re-organize.
> Also, we should add {{~/.sbt}} into cache. See 
> https://github.com/sbt/sbt/issues/3681



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233254#comment-17233254
 ] 

Apache Spark commented on SPARK-33464:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30391

> Add/remove (un)necessary cache and restructure GitHub Actions yaml
> --
>
> Key: SPARK-33464
> URL: https://issues.apache.org/jira/browse/SPARK-33464
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, GitHub Actions build has some unnecessary cache/commands. For 
> example, if you run SBT only .m2 cache is not needed. We should clean up and 
> re-organize.
> Also, we should add {{~/.sbt}} into cache. See 
> https://github.com/sbt/sbt/issues/3681



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33464:


Assignee: Apache Spark

> Add/remove (un)necessary cache and restructure GitHub Actions yaml
> --
>
> Key: SPARK-33464
> URL: https://issues.apache.org/jira/browse/SPARK-33464
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, GitHub Actions build has some unnecessary cache/commands. For 
> example, if you run SBT only .m2 cache is not needed. We should clean up and 
> re-organize.
> Also, we should add {{~/.sbt}} into cache. See 
> https://github.com/sbt/sbt/issues/3681



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml

2020-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33464:
-
Description: 
Currently, GitHub Actions build has some unnecessary cache/commands. For 
example, if you run SBT only .m2 cache is not needed. We should clean up and 
re-organize.
Also, we should add {{~/.sbt}} into cache. See 
https://github.com/sbt/sbt/issues/3681

  was:Currently, GitHub Actions build has some unnecessary cache/commands. For 
example, if you run SBT only .m2 cache is not needed. We should clean up and 
re-organize.


> Add/remove (un)necessary cache and restructure GitHub Actions yaml
> --
>
> Key: SPARK-33464
> URL: https://issues.apache.org/jira/browse/SPARK-33464
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, GitHub Actions build has some unnecessary cache/commands. For 
> example, if you run SBT only .m2 cache is not needed. We should clean up and 
> re-organize.
> Also, we should add {{~/.sbt}} into cache. See 
> https://github.com/sbt/sbt/issues/3681



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml

2020-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33464:
-
Summary: Add/remove (un)necessary cache and restructure GitHub Actions yaml 
 (was: Remove unnecessary cache and restructure)

> Add/remove (un)necessary cache and restructure GitHub Actions yaml
> --
>
> Key: SPARK-33464
> URL: https://issues.apache.org/jira/browse/SPARK-33464
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, GitHub Actions build has some unnecessary cache/commands. For 
> example, if you run SBT only .m2 cache is not needed. We should clean up and 
> re-organize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation

2020-11-16 Thread Nilesh Patil (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233249#comment-17233249
 ] 

Nilesh Patil commented on SPARK-33395:
--

[~zhangway], 

Hi, I am expecting out like below. 

DAta
1200404151072.1211
1200404151073
1200404151074.1232323
1200404151075.124344
1200404151076.12
1200404151077.12343
1200404151078.12
1200404151079.12544545454554
1251080.123444
1

 

Code we are using 

dataset = sparkSession.option("header",true).option("multiLine", true) 
.option("inferSchema",true) .csv(filePathSeq);

 

> Spark reading data in scientific notation
> -
>
> Key: SPARK-33395
> URL: https://issues.apache.org/jira/browse/SPARK-33395
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Critical
>
> File is having below data
> DAta
> 1200404151072.1211
> 1200404151073
> 1200404151074.1232323
> 1200404151075.124344
> 1200404151076.12
> 1200404151077.12343
> 1200404151078.12
> 1200404151079.12544545454554
> 1251080.123444
> 1
>  
> Spark is reading with scientific notation as we wanted to read data as it is 
> available in file with accurate datatype not with string datatype.
> ++
> | DAta|
> ++
> |1.200404151072121E12|
> | 1.200404151073E12|
> |1.200404151074123...|
> |1.200404151075124...|
> | 1.20040415107612E12|
> |1.200404151077123...|
> | 1.20040415107812E12|
> |1.200404151079125...|
> | 1251080.123445|
> | 1.0E28|
> +
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33464) Remove unnecessary cache and restructure

2020-11-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33464:


 Summary: Remove unnecessary cache and restructure
 Key: SPARK-33464
 URL: https://issues.apache.org/jira/browse/SPARK-33464
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Hyukjin Kwon


Currently, GitHub Actions build has some unnecessary cache/commands. For 
example, if you run SBT only .m2 cache is not needed. We should clean up and 
re-organize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33454) Add GitHub Action job for Hadoop 2

2020-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33454:
-
Parent: SPARK-32244
Issue Type: Sub-task  (was: New Feature)

> Add GitHub Action job for Hadoop 2
> --
>
> Key: SPARK-33454
> URL: https://issues.apache.org/jira/browse/SPARK-33454
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to prevent accidental compilation error with Hadoop 2 profile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Prashant Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233227#comment-17233227
 ] 

Prashant Sharma commented on SPARK-30985:
-

Thanks [~dongjoon], you have resolved the confusion I had. Indeed, this is what 
was intended.

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33379) The link address of ‘’this page’’ in docs/pyspark-migration-guide.md is incorrect

2020-11-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33379.
--
Resolution: Not A Problem

> The link address of ‘’this page’’ in docs/pyspark-migration-guide.md is 
> incorrect
> -
>
> Key: SPARK-33379
> URL: https://issues.apache.org/jira/browse/SPARK-33379
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: 董可伦
>Priority: Major
>  Labels: documentation
> Attachments: SPARK-33379.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
>  
>  
>   The link address of ‘’this page’’ in +docs/pyspark-migration-guide.md+ is 
> incorrect, and it will show ""Not Found The requested URL was not found on 
> this server.""
> [[https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md]|[https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala

2020-11-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-33209:


Assignee: Cheng Su

> Clean up unit test file UnsupportedOperationsSuite.scala
> 
>
> Key: SPARK-33209
> URL: https://issues.apache.org/jira/browse/SPARK-33209
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
>
> As a follow up from [https://github.com/apache/spark/pull/30076,] there are 
> many copy-paste in the unit test file UnsupportedOperationsSuite.scala to 
> check different join types (inner, outer, semi) with similar code structure. 
> It would be helpful to clean them up and refactor to reuse code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala

2020-11-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33209.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30347
[https://github.com/apache/spark/pull/30347]

> Clean up unit test file UnsupportedOperationsSuite.scala
> 
>
> Key: SPARK-33209
> URL: https://issues.apache.org/jira/browse/SPARK-33209
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> As a follow up from [https://github.com/apache/spark/pull/30076,] there are 
> many copy-paste in the unit test file UnsupportedOperationsSuite.scala to 
> check different join types (inner, outer, semi) with similar code structure. 
> It would be helpful to clean them up and refactor to reuse code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33445) Can't parse decimal type from csv file

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33445.
---
Resolution: Cannot Reproduce

I'll close this for now, [~bullsoverbears] . However, you are welcome to reopen 
this with the reproducible example. Thanks for reporting.

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7, 3.0.0
>Reporter: Punit Shah
>Priority: Major
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233191#comment-17233191
 ] 

Dongjoon Hyun commented on SPARK-33445:
---

PySpark is also working.
{code:java}
>>> mydf2 = spark.read.csv("tsd.csv", header=True, inferSchema=True)
>>> mydf2.schema
StructType(List(StructField(Epoch Miliseconds,DoubleType,true)))
>>> sc.version
'3.0.1' {code}

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7, 3.0.0
>Reporter: Punit Shah
>Priority: Major
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233189#comment-17233189
 ] 

Dongjoon Hyun commented on SPARK-33445:
---

[~bullsoverbears]. For me, I don't see `ValueError`. Are the affected versions 
correct?

*Apache Spark 2.4.7*
{code}
scala> spark.version
res0: String = 2.4.7

scala> spark.read.option("header", "true").option("inferSchema", 
"true").csv("tsd.csv").printSchema
root
 |-- Epoch Miliseconds: decimal(6,-7) (nullable = true)
{code}

*Apache Spark 3.0.0*
{code}
scala> spark.version
res0: String = 3.0.0

scala> spark.read.option("header", "true").option("inferSchema", 
"true").csv("tsd.csv").printSchema
root
 |-- Epoch Miliseconds: double (nullable = true)
{code}

*Apache Spark 3.0.1*
{code}
scala> spark.version
res0: String = 3.0.1

scala> spark.read.option("header", "true").option("inferSchema", 
"true").csv("tsd.csv").printSchema
root
 |-- Epoch Miliseconds: double (nullable = true)
{code}

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7, 3.0.0
>Reporter: Punit Shah
>Priority: Major
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33445) Can't parse decimal type from csv file

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33445:
--
Component/s: (was: PySpark)

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7, 3.0.0
>Reporter: Punit Shah
>Priority: Major
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33449) Add cache for Parquet Metadata

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233184#comment-17233184
 ] 

Dongjoon Hyun commented on SPARK-33449:
---

Is this targeting at Apache Spark 3.1, [~yumwang]?

> Add cache for Parquet Metadata
> --
>
> Key: SPARK-33449
> URL: https://issues.apache.org/jira/browse/SPARK-33449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Get Parquet metadata.png
>
>
> Get Parquet metadata may takes a lot of time, maybe we can cache it. Presto 
> support it:
> https://github.com/prestodb/presto/pull/15276



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33399) Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes

2020-11-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33399.
--
Fix Version/s: 3.1.0
 Assignee: Prakhar Jain
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30300

> Normalize output partitioning and sortorder with respect to aliases to avoid 
> unneeded exchange/sort nodes
> -
>
> Key: SPARK-33399
> URL: https://issues.apache.org/jira/browse/SPARK-33399
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Prakhar Jain
>Assignee: Prakhar Jain
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark introduces unneeded exchanges if there is a Project after Inner join. 
> Example:
>  
> {noformat}
> spark.range(10).repartition($"id").createTempView("t1")
> spark.range(20).repartition($"id").createTempView("t2")
> spark.range(30).repartition($"id").createTempView("t3")
> val planned = sql(
>"""
>  |SELECT t2id, t3.id as t3id
>  |FROM (
>  |SELECT t1.id as t1id, t2.id as t2id
>  |FROM t1, t2
>  |WHERE t1.id = t2.id
>  |) t12, t3
>  |WHERE t1id = t3.id
>""".stripMargin).queryExecution.executedPlan
> *(9) Project [t2id#1034L, id#1004L AS t3id#1035L]
> +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner
>:- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] 
> <---
>: +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L]
>:+- *(5) SortMergeJoin [id#996L], [id#1000L], Inner
>:   :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0
>:   :  +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329]
>:   : +- *(1) Range (0, 10, step=1, splits=2)
>:   +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335]
>: +- *(3) Range (0, 20, step=1, splits=2)
>+- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349]
>  +- *(7) Range (0, 30, step=1, splits=2){noformat}
> The marked exchange in the above plan can be removed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-23499:
-

Assignee: Pascal GILLET

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Pascal GILLET
>Assignee: Pascal GILLET
>Priority: Major
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
> {code:java}
> # Conf on the dispatcher side
> spark.mesos.dispatcher.queue.URGENT=1.0
> # Conf on the driver side
> spark.mesos.dispatcher.queue=URGENT
> spark.mesos.role=URGENT
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23499.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30352
[https://github.com/apache/spark/pull/30352]

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Pascal GILLET
>Assignee: Pascal GILLET
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
> {code:java}
> # Conf on the dispatcher side
> spark.mesos.dispatcher.queue.URGENT=1.0
> # Conf on the driver side
> spark.mesos.dispatcher.queue=URGENT
> spark.mesos.role=URGENT
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233126#comment-17233126
 ] 

Apache Spark commented on SPARK-33463:
--

User 'gumartinm' has created a pull request for this issue:
https://github.com/apache/spark/pull/30390

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Priority: Major
> Fix For: 3.0.1
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33463:


Assignee: Apache Spark

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.1
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33463:


Assignee: (was: Apache Spark)

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Priority: Major
> Fix For: 3.0.1
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Gustavo Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gustavo Martin updated SPARK-33463:
---
Fix Version/s: (was: 3.0.1)
   3.1.0

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Priority: Major
> Fix For: 3.1.0
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233125#comment-17233125
 ] 

Apache Spark commented on SPARK-33463:
--

User 'gumartinm' has created a pull request for this issue:
https://github.com/apache/spark/pull/30390

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Priority: Major
> Fix For: 3.0.1
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233124#comment-17233124
 ] 

Gustavo Martin commented on SPARK-33463:


PR: https://github.com/apache/spark/pull/30390

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Priority: Major
> Fix For: 3.0.1
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Gustavo Martin (Jira)
Gustavo Martin created SPARK-33463:
--

 Summary: Spark Thrift Server, keep Job Id when using incremental 
collect
 Key: SPARK-33463
 URL: https://issues.apache.org/jira/browse/SPARK-33463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.7, 2.3.4
Reporter: Gustavo Martin
 Fix For: 3.0.1


When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost and 
tracing queries in Spark Thrift Server ends up being too complicated.

By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
each other.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233058#comment-17233058
 ] 

Dongjoon Hyun edited comment on SPARK-33183 at 11/16/20, 8:23 PM:
--

I added SPARK-23973 as "is caused by". If then, Apache Spark 2.3 seems to be 
okay.


was (Author: dongjoon):
I added SPARK-23973 as "is caused by".

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233058#comment-17233058
 ] 

Dongjoon Hyun edited comment on SPARK-33183 at 11/16/20, 8:23 PM:
--

I added SPARK-23973 as "is caused by". If then, Apache Spark 2.3 seems to be 
okay. Please let me know if this affects older Sparks.


was (Author: dongjoon):
I added SPARK-23973 as "is caused by". If then, Apache Spark 2.3 seems to be 
okay.

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233058#comment-17233058
 ] 

Dongjoon Hyun commented on SPARK-33183:
---

I added SPARK-23973 as "is caused by".

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33183:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3
   2.4.4
   2.4.5
   2.4.6
   2.4.7

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31450) Make ExpressionEncoder thread safe

2020-11-16 Thread Navin Viswanath (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233054#comment-17233054
 ] 

Navin Viswanath commented on SPARK-31450:
-

[~hvanhovell]  [~dongjoon] I was in the process of migrating some code from 
Spark 2.4 to Spark 3 and noticed that this required a change in our code. We 
use the following process to go from a Thrift type T to InternalRow(reading 
thrift files on HDFS into a Dataframe):
 # We construct a Spark schema by inspecting the thrift metadata.
 # We convert a thrift object to a GenericRow using the thrift metadata to read 
columns.
 # We then construct an ExpressionEncoder[Row] and use it to create an 
InternalRow as follows:
{code:java}
val schema: StructType = ... // infer thrift schema
val encoder: ExpressionEncoder[Row] = RowEncoder(schema)
val genericRow: GenericRow = toGenericRow(thriftObject, schema)
val internalRow: InternalRow = encoder.toRow(genericRow)
{code}

The above steps are used to implement
{code:java}
protected def buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
{code}
in trait org.apache.spark.sql.execution.datasources.FileFormat where we need an 
Iterator[InternalRow].

With the change in this ticket, I would have to replace 
{code:java}
val internalRow: InternalRow = encoder.toRow(genericRow)  
{code}
with
{code:java}
val serializer = encoder.createSerializer()
val internalRow: InternalRow = serializer(genericRow){code}
Since this is marked as an internal API in the PR, I was wondering if there is 
a way to implement this so that it is compatible with both Spark 2.4 and Spark 
3. 

My goal is to not require a code change if possible. It seems to me that since 
I know the schema of the thrift type it should be possible to construct an 
InternalRow, but I don't see a way to do this in the code base.

> Make ExpressionEncoder thread safe
> --
>
> Key: SPARK-31450
> URL: https://issues.apache.org/jira/browse/SPARK-31450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.0.0
>
>
> ExpressionEncoder is currently not thread-safe because it contains stateful 
> objects that are required for converting objects to internal rows and vise 
> versa. We have been working around this by (excessively) cloning 
> ExpressionEncoders which is not free. I propose that we move the stateful 
> bits of the expression encoder into two helper classes that will take care of 
> the conversions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33435) DSv2: REFRESH TABLE should invalidate caches

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33435:
--
Labels: DSv2 correctness  (was: DSv2)

> DSv2: REFRESH TABLE should invalidate caches
> 
>
> Key: SPARK-33435
> URL: https://issues.apache.org/jira/browse/SPARK-33435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: DSv2, correctness
> Fix For: 3.0.2, 3.1.0
>
>
> Currently, in DSv2 {{RefreshTableExec}}, we only invalidate metadata cache 
> but not all the caches that referencing the table to be refreshed. This may 
> cause correctness issue if these caches go stale and get queried later.
> Note that since we don't support caching a v2 table yet, we can't recache the 
> table itself at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33462) ResourceProfile use Int for memory in ExecutorResourcesOrDefaults

2020-11-16 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-33462:
-

 Summary: ResourceProfile use Int for memory in 
ExecutorResourcesOrDefaults
 Key: SPARK-33462
 URL: https://issues.apache.org/jira/browse/SPARK-33462
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Thomas Graves


A followup from SPARK-33288, since memory is in MB we can just store as Int 
rather then Long in ExecutorResourcesOrDefaults.

 

See https://github.com/apache/spark/pull/30375#issuecomment-728270233



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33460) Accessing map values should fail if key is not found.

2020-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33460.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30386
[https://github.com/apache/spark/pull/30386]

> Accessing map values should fail if key is not found.
> -
>
> Key: SPARK-33460
> URL: https://issues.apache.org/jira/browse/SPARK-33460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> When ansi mode enabled, accessing map values should failed with exception if 
> key does not exist , but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33460) Accessing map values should fail if key is not found.

2020-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33460:
---

Assignee: Leanken.Lin

> Accessing map values should fail if key is not found.
> -
>
> Key: SPARK-33460
> URL: https://issues.apache.org/jira/browse/SPARK-33460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
>
> When ansi mode enabled, accessing map values should failed with exception if 
> key does not exist , but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32222) Add K8s IT for conf propagation

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-3:
--
Summary: Add K8s IT for conf propagation  (was: Add integration tests)

> Add K8s IT for conf propagation
> ---
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> An integration test by placing a configuration file in SPARK_CONF_DIR, and 
> verifying it is loaded on the executors in both client and cluster deploy 
> mode. 
> For this, a log4j.properties file is a good candidate for testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33453) Unify v1 and v2 SHOW PARTITIONS tests

2020-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33453.
-
Fix Version/s: 3.1.0
 Assignee: Maxim Gekk
   Resolution: Fixed

> Unify v1 and v2 SHOW PARTITIONS tests
> -
>
> Key: SPARK-33453
> URL: https://issues.apache.org/jira/browse/SPARK-33453
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Gather common tests for DSv1 and DSv2 SHOW PARTITIONS command to a common 
> test. Mix this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32222) Add integration tests

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-3:
-

Assignee: Prashant Sharma

> Add integration tests
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> An integration test by placing a configuration file in SPARK_CONF_DIR, and 
> verifying it is loaded on the executors in both client and cluster deploy 
> mode. 
> For this, a log4j.properties file is a good candidate for testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33461) Propagating SPARK_CONF_DIR in K8s and tests

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232864#comment-17232864
 ] 

Dongjoon Hyun commented on SPARK-33461:
---

I reorganized this for you as an example, [~prashant]. I hope that is what you 
wanted~ Please let me know if you want to reorganize.

> Propagating SPARK_CONF_DIR in K8s and tests
> ---
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33461) Propagating SPARK_CONF_DIR in K8s and tests

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33461:
--
Fix Version/s: (was: 3.1.0)

> Propagating SPARK_CONF_DIR in K8s and tests
> ---
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33461) Propagating SPARK_CONF_DIR in K8s and tests

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33461:
--
Summary: Propagating SPARK_CONF_DIR in K8s and tests  (was: Foundational 
work for propagating SPARK_CONF_DIR)

> Propagating SPARK_CONF_DIR in K8s and tests
> ---
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30985:
--
Parent: SPARK-33461
Issue Type: Sub-task  (was: Improvement)

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32223) Support adding a user provided config map.

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32223:
--
Parent: SPARK-33461  (was: SPARK-30985)

> Support adding a user provided config map.
> --
>
> Key: SPARK-32223
> URL: https://issues.apache.org/jira/browse/SPARK-32223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> The semantics of this will be discussed and added soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33461:
--
Parent: (was: SPARK-30985)
Issue Type: Improvement  (was: Sub-task)

> Foundational work for propagating SPARK_CONF_DIR
> 
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32221:
--
Parent: SPARK-33461  (was: SPARK-30985)

> Avoid possible errors due to incorrect file size or type supplied in spark 
> conf.
> 
>
> Key: SPARK-32221
> URL: https://issues.apache.org/jira/browse/SPARK-32221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the SPARK_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1 MiB. Once etcd is upgraded in all the popular k8s clusters, then we 
> can hope to overcome this limitation. e.g. 
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for 
> higher limit on each entry.
> Even if that does not happen, there are other ways to overcome this 
> limitation, for example, we can have config files split across multiple 
> configMaps. We need to discuss, and prioritise, this issue takes the 
> straightforward approach of skipping files that cannot be accommodated within 
> 1MiB limit and WARNING the user about the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32222) Add integration tests

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-3:
--
Parent: SPARK-33461  (was: SPARK-30985)

> Add integration tests
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> An integration test by placing a configuration file in SPARK_CONF_DIR, and 
> verifying it is loaded on the executors in both client and cluster deploy 
> mode. 
> For this, a log4j.properties file is a good candidate for testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232860#comment-17232860
 ] 

Dongjoon Hyun commented on SPARK-33461:
---

The PR is merged by SPARK-30985, not SPARK-33461.

The JIRA ID is important, [~prashant]. You can make this issue as an umbrella 
instead.

> Foundational work for propagating SPARK_CONF_DIR
> 
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-33461:
---

> Foundational work for propagating SPARK_CONF_DIR
> 
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232852#comment-17232852
 ] 

Dongjoon Hyun commented on SPARK-30985:
---

Hi, [~prashant]. You should not reopen this. We cannot change the commit log.

If you really need an umbrella JIRA, please create one and move this and all 
subtasks into that new umbrella JIRA.

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30985.
---
  Assignee: Prashant Sharma
Resolution: Fixed

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33389) make internal classes of SparkSession always using active SQLConf

2020-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33389.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30299
[https://github.com/apache/spark/pull/30299]

> make internal classes of SparkSession always using active SQLConf
> -
>
> Key: SPARK-33389
> URL: https://issues.apache.org/jira/browse/SPARK-33389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Assignee: Lu Lu
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33389) make internal classes of SparkSession always using active SQLConf

2020-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33389:
---

Assignee: Lu Lu

> make internal classes of SparkSession always using active SQLConf
> -
>
> Key: SPARK-33389
> URL: https://issues.apache.org/jira/browse/SPARK-33389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Assignee: Lu Lu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-11-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232824#comment-17232824
 ] 

Gabor Somogyi commented on SPARK-33143:
---

[~mszurap] the OS and network guys are still working on it but one thing seems 
sure.

It has nothing to do with the RDD size. It's reproducible w/ relatively small 
RDDs.

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232818#comment-17232818
 ] 

Apache Spark commented on SPARK-33143:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/30389

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33143:


Assignee: (was: Apache Spark)

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33143:


Assignee: Apache Spark

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32222) Add integration tests

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: (was: Apache Spark)

> Add integration tests
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> An integration test by placing a configuration file in SPARK_CONF_DIR, and 
> verifying it is loaded on the executors in both client and cluster deploy 
> mode. 
> For this, a log4j.properties file is a good candidate for testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32222) Add integration tests

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232762#comment-17232762
 ] 

Apache Spark commented on SPARK-3:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30388

> Add integration tests
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> An integration test by placing a configuration file in SPARK_CONF_DIR, and 
> verifying it is loaded on the executors in both client and cluster deploy 
> mode. 
> For this, a log4j.properties file is a good candidate for testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32222) Add integration tests

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: Apache Spark

> Add integration tests
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Major
>
> An integration test by placing a configuration file in SPARK_CONF_DIR, and 
> verifying it is loaded on the executors in both client and cluster deploy 
> mode. 
> For this, a log4j.properties file is a good candidate for testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2020-11-16 Thread Pablo Cocko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232671#comment-17232671
 ] 

Pablo Cocko commented on SPARK-17914:
-

I think that the solution only apply if there are more than 6 digits as 
miliseconds.

https://github.com/apache/spark/pull/18252/commits/2f232a7bda28fb42759ee35923044f886a1ff19e

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 2.2.0, 2.3.0
>
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33443:


Assignee: (was: Apache Spark)

> LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
> 
>
> Key: SPARK-33443
> URL: https://issues.apache.org/jira/browse/SPARK-33443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the 
> mainstream database support this syntax.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232664#comment-17232664
 ] 

Apache Spark commented on SPARK-33443:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/30387

> LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
> 
>
> Key: SPARK-33443
> URL: https://issues.apache.org/jira/browse/SPARK-33443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the 
> mainstream database support this syntax.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33443:


Assignee: Apache Spark

> LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
> 
>
> Key: SPARK-33443
> URL: https://issues.apache.org/jira/browse/SPARK-33443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the 
> mainstream database support this syntax.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232665#comment-17232665
 ] 

Apache Spark commented on SPARK-33443:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/30387

> LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
> 
>
> Key: SPARK-33443
> URL: https://issues.apache.org/jira/browse/SPARK-33443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the 
> mainstream database support this syntax.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]

2020-11-16 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-33443:
---
Description: The current implement of LEAD/LAG don't support IGNORE/RESPECT 
NULLS, but the mainstream database support this syntax.  (was: The current 
implement of LEAD/LAG could not support IGNORE/RESPECT NULLS, but the 
mainstream database support this syntax.)

> LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
> 
>
> Key: SPARK-33443
> URL: https://issues.apache.org/jira/browse/SPARK-33443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the 
> mainstream database support this syntax.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33460) Accessing map values should fail if key is not found.

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33460:


Assignee: Apache Spark

> Accessing map values should fail if key is not found.
> -
>
> Key: SPARK-33460
> URL: https://issues.apache.org/jira/browse/SPARK-33460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Major
>
> When ansi mode enabled, accessing map values should failed with exception if 
> key does not exist , but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33460) Accessing map values should fail if key is not found.

2020-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232658#comment-17232658
 ] 

Apache Spark commented on SPARK-33460:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/30386

> Accessing map values should fail if key is not found.
> -
>
> Key: SPARK-33460
> URL: https://issues.apache.org/jira/browse/SPARK-33460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> When ansi mode enabled, accessing map values should failed with exception if 
> key does not exist , but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33460) Accessing map values should fail if key is not found.

2020-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33460:


Assignee: (was: Apache Spark)

> Accessing map values should fail if key is not found.
> -
>
> Key: SPARK-33460
> URL: https://issues.apache.org/jira/browse/SPARK-33460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> When ansi mode enabled, accessing map values should failed with exception if 
> key does not exist , but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Prashant Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232631#comment-17232631
 ] 

Prashant Sharma edited comment on SPARK-30985 at 11/16/20, 9:21 AM:


Reopening this JIRA as this was closed because, I created a PR incorrectly 
targeting the umbrella JIRA instead of the subtask : 


was (Author: prashant_):
Reopening this JIRA as, this is Umbrella jira and I created a PR incorrectly 
targeting the umbrella JIRA instead of the subtask : 

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-30985:
---

Assignee: (was: Prashant Sharma)

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR

2020-11-16 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-33461:
---

Assignee: Prashant Sharma

> Foundational work for propagating SPARK_CONF_DIR
> 
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR

2020-11-16 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-33461.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

> Foundational work for propagating SPARK_CONF_DIR
> 
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR

2020-11-16 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-33461:
---

 Summary: Foundational work for propagating SPARK_CONF_DIR
 Key: SPARK-33461
 URL: https://issues.apache.org/jira/browse/SPARK-33461
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Prashant Sharma


Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reopened SPARK-30985:
-

Reopening this JIRA as, this is Umbrella jira and I created a PR incorrectly 
targeting the umbrella JIRA instead of the subtask : 

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Prashant Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232628#comment-17232628
 ] 

Prashant Sharma commented on SPARK-30985:
-

[~dongjoon] Hm.. my mistake. As you said, I added subtasks after creating the 
PR and this JIRA. I will re-open this JIRA and create a subtask and resolve it.

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232601#comment-17232601
 ] 

Dongjoon Hyun commented on SPARK-30985:
---

[~prashant]. You should not make a PR with the umbrella JIRA. The umbrella Jira 
is resolved when the all subtask JIRAs are resolved. If you make a PR with the 
umbrella Jira ID, Spark merge script resolves it.

 

Since this seems to be created long time ago, the Jira management is up to you. 
The above comment is just my suggestion.

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232601#comment-17232601
 ] 

Dongjoon Hyun edited comment on SPARK-30985 at 11/16/20, 8:14 AM:
--

[~prashant]. You should not make a PR with the umbrella JIRA. The umbrella Jira 
should be resolved when the all subtask JIRAs are resolved. However, if you 
make a PR with the umbrella Jira ID, Spark merge script resolves it.

 

Since this seems to be created long time ago, the Jira management is up to you. 
The above comment is just my suggestion.


was (Author: dongjoon):
[~prashant]. You should not make a PR with the umbrella JIRA. The umbrella Jira 
is resolved when the all subtask JIRAs are resolved. If you make a PR with the 
umbrella Jira ID, Spark merge script resolves it.

 

Since this seems to be created long time ago, the Jira management is up to you. 
The above comment is just my suggestion.

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-11-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232600#comment-17232600
 ] 

Dongjoon Hyun edited comment on SPARK-30985 at 11/16/20, 8:08 AM:
--

~It's a little weird to convert a merged JIRA into an umbrella.~ Hmm. It seems 
that I was confused the history.


was (Author: dongjoon):
It's a little weird to convert a merged JIRA into an umbrella.

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >