[jira] [Commented] (SPARK-24554) Add MapType Support for Arrow in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233307#comment-17233307 ] Apache Spark commented on SPARK-24554: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/30393 > Add MapType Support for Arrow in PySpark > > > Key: SPARK-24554 > URL: https://issues.apache.org/jira/browse/SPARK-24554 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > Labels: bulk-closed > > Add support for MapType in Arrow related classes in Scala/Java and pyarrow > functionality in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24554) Add MapType Support for Arrow in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24554: Assignee: Apache Spark > Add MapType Support for Arrow in PySpark > > > Key: SPARK-24554 > URL: https://issues.apache.org/jira/browse/SPARK-24554 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Major > Labels: bulk-closed > > Add support for MapType in Arrow related classes in Scala/Java and pyarrow > functionality in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24554) Add MapType Support for Arrow in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24554: Assignee: (was: Apache Spark) > Add MapType Support for Arrow in PySpark > > > Key: SPARK-24554 > URL: https://issues.apache.org/jira/browse/SPARK-24554 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > Labels: bulk-closed > > Add support for MapType in Arrow related classes in Scala/Java and pyarrow > functionality in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
L. C. Hsieh created SPARK-33465: --- Summary: RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead Key: SPARK-33465 URL: https://issues.apache.org/jira/browse/SPARK-33465 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted elements into a `BoundedPriorityQueue`. So actually in the result RDD each partition has only one priority queue. Then the API calls {{RDD.reduce}} API to reduce the elements. But as mentioned before the RDD has only one queue at each partition, it doesn't make sense to call reduce to reduce elements (here the element is queue). We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or replace it with {{treeReduce}} which can actually do partially reducing for this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
[ https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33465: Assignee: L. C. Hsieh (was: Apache Spark) > RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead > --- > > Key: SPARK-33465 > URL: https://issues.apache.org/jira/browse/SPARK-33465 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted > elements into a `BoundedPriorityQueue`. So actually in the result RDD each > partition has only one priority queue. Then the API calls {{RDD.reduce}} API > to reduce the elements. But as mentioned before the RDD has only one queue at > each partition, it doesn't make sense to call reduce to reduce elements (here > the element is queue). > We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or > replace it with {{treeReduce}} which can actually do partially reducing for > this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
[ https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233285#comment-17233285 ] Apache Spark commented on SPARK-33465: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30392 > RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead > --- > > Key: SPARK-33465 > URL: https://issues.apache.org/jira/browse/SPARK-33465 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted > elements into a `BoundedPriorityQueue`. So actually in the result RDD each > partition has only one priority queue. Then the API calls {{RDD.reduce}} API > to reduce the elements. But as mentioned before the RDD has only one queue at > each partition, it doesn't make sense to call reduce to reduce elements (here > the element is queue). > We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or > replace it with {{treeReduce}} which can actually do partially reducing for > this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
[ https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33465: Assignee: Apache Spark (was: L. C. Hsieh) > RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead > --- > > Key: SPARK-33465 > URL: https://issues.apache.org/jira/browse/SPARK-33465 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted > elements into a `BoundedPriorityQueue`. So actually in the result RDD each > partition has only one priority queue. Then the API calls {{RDD.reduce}} API > to reduce the elements. But as mentioned before the RDD has only one queue at > each partition, it doesn't make sense to call reduce to reduce elements (here > the element is queue). > We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or > replace it with {{treeReduce}} which can actually do partially reducing for > this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33465) RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead
[ https://issues.apache.org/jira/browse/SPARK-33465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233284#comment-17233284 ] Apache Spark commented on SPARK-33465: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30392 > RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead > --- > > Key: SPARK-33465 > URL: https://issues.apache.org/jira/browse/SPARK-33465 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > {{RDD.takeOrdered}} API sorts elements in each partition and puts the sorted > elements into a `BoundedPriorityQueue`. So actually in the result RDD each > partition has only one priority queue. Then the API calls {{RDD.reduce}} API > to reduce the elements. But as mentioned before the RDD has only one queue at > each partition, it doesn't make sense to call reduce to reduce elements (here > the element is queue). > We should either simplify {{RDD.reduce}} call in {{RDD.takeOrdered}} or > replace it with {{treeReduce}} which can actually do partially reducing for > this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233279#comment-17233279 ] Nilesh Patil edited comment on SPARK-33395 at 11/17/20, 5:57 AM: - In decimal type scale and precision will be constant for all rows, where as all row may vary its scale and precision. So decimal will not solve our problem. was (Author: nileshpatil1992): In decimal type scale and precision will be constant for all rows, where as all row may vary its scale and precision. > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Major > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233279#comment-17233279 ] Nilesh Patil commented on SPARK-33395: -- In decimal type scale and precision will be constant for all rows, where as all row may vary its scale and precision. > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Major > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233276#comment-17233276 ] Takeshi Yamamuro commented on SPARK-33395: -- How about using a decimal type instead? > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Major > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233271#comment-17233271 ] Nilesh Patil commented on SPARK-33395: -- Can we have any method implementation which will persist the datatype but and will display the data in original form ? Something like this. As this is issue is reported by our 3 clients on production so wanted to take it on high priority. > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Major > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33395: - Issue Type: Improvement (was: Bug) > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Critical > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233267#comment-17233267 ] Takeshi Yamamuro commented on SPARK-33395: -- hm, but we cannot avoid the rounding in this case, I think. Any idea? Either way, I think this is an expected behaviour, so I will change "Bug" -> "Improvement" now. > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Critical > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33395: - Priority: Major (was: Critical) > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Major > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233264#comment-17233264 ] Nilesh Patil commented on SPARK-33395: -- Yes inferred type is double, but we want the data as it is as it have in file not in the scientific notation also not in the form rounding. > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Critical > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33395: - Affects Version/s: 3.1.0 2.4.8 > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Critical > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33395: - Component/s: (was: Spark Core) SQL > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.8, 3.1.0 >Reporter: Nilesh Patil >Priority: Critical > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233263#comment-17233263 ] Takeshi Yamamuro commented on SPARK-33395: -- The inferred type is double, so they are approximate values. What do you suggest here? You think we should use decimal in this case instead? > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Critical > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33407) Simplify the exception message from Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-33407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33407. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30309 [https://github.com/apache/spark/pull/30309] > Simplify the exception message from Python UDFs > --- > > Key: SPARK-33407 > URL: https://issues.apache.org/jira/browse/SPARK-33407 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Currently, the exception message is as below: > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/sql/dataframe.py", line 427, in show > print(self._jdf.showString(n, 20, vertical)) > File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, > in __call__ > File "/.../python/pyspark/sql/utils.py", line 127, in deco > raise_from(converted) > File "", line 3, in raise_from > pyspark.sql.utils.PythonException: > An exception was thrown from Python worker in the executor: > Traceback (most recent call last): > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main > process() > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process > serializer.dump_stream(out_iter, outfile) > File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in > dump_stream > for obj in iterator: > File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in > _batched > for item in iterator: > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper > result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in > udfs) > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in > result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in > udfs) > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in > return lambda *a: f(*a) > File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper > return f(*args, **kwargs) > File "", line 3, in divide_by_zero > ZeroDivisionError: division by zero > {code} > Actually, almost all cases, users only care about {{ZeroDivisionError: > division by zero > }}. We don't really have to show the internal stuff in 99% cases. > We could just make it short, for example, > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/sql/dataframe.py", line 427, in show > print(self._jdf.showString(n, 20, vertical)) > File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, > in __call__ > File "/.../python/pyspark/sql/utils.py", line 127, in deco > raise_from(converted) > File "", line 3, in raise_from > pyspark.sql.utils.PythonException: > An exception was thrown from Python worker in the executor: > Traceback (most recent call last): > File "", line 3, in divide_by_zero > ZeroDivisionError: division by zero > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33407) Simplify the exception message from Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-33407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33407: Assignee: Hyukjin Kwon > Simplify the exception message from Python UDFs > --- > > Key: SPARK-33407 > URL: https://issues.apache.org/jira/browse/SPARK-33407 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, the exception message is as below: > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/sql/dataframe.py", line 427, in show > print(self._jdf.showString(n, 20, vertical)) > File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, > in __call__ > File "/.../python/pyspark/sql/utils.py", line 127, in deco > raise_from(converted) > File "", line 3, in raise_from > pyspark.sql.utils.PythonException: > An exception was thrown from Python worker in the executor: > Traceback (most recent call last): > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main > process() > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process > serializer.dump_stream(out_iter, outfile) > File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in > dump_stream > for obj in iterator: > File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in > _batched > for item in iterator: > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper > result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in > udfs) > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in > result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in > udfs) > File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in > return lambda *a: f(*a) > File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper > return f(*args, **kwargs) > File "", line 3, in divide_by_zero > ZeroDivisionError: division by zero > {code} > Actually, almost all cases, users only care about {{ZeroDivisionError: > division by zero > }}. We don't really have to show the internal stuff in 99% cases. > We could just make it short, for example, > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/sql/dataframe.py", line 427, in show > print(self._jdf.showString(n, 20, vertical)) > File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, > in __call__ > File "/.../python/pyspark/sql/utils.py", line 127, in deco > raise_from(converted) > File "", line 3, in raise_from > pyspark.sql.utils.PythonException: > An exception was thrown from Python worker in the executor: > Traceback (most recent call last): > File "", line 3, in divide_by_zero > ZeroDivisionError: division by zero > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml
[ https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33464: Assignee: (was: Apache Spark) > Add/remove (un)necessary cache and restructure GitHub Actions yaml > -- > > Key: SPARK-33464 > URL: https://issues.apache.org/jira/browse/SPARK-33464 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, GitHub Actions build has some unnecessary cache/commands. For > example, if you run SBT only .m2 cache is not needed. We should clean up and > re-organize. > Also, we should add {{~/.sbt}} into cache. See > https://github.com/sbt/sbt/issues/3681 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml
[ https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233254#comment-17233254 ] Apache Spark commented on SPARK-33464: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30391 > Add/remove (un)necessary cache and restructure GitHub Actions yaml > -- > > Key: SPARK-33464 > URL: https://issues.apache.org/jira/browse/SPARK-33464 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, GitHub Actions build has some unnecessary cache/commands. For > example, if you run SBT only .m2 cache is not needed. We should clean up and > re-organize. > Also, we should add {{~/.sbt}} into cache. See > https://github.com/sbt/sbt/issues/3681 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml
[ https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33464: Assignee: Apache Spark > Add/remove (un)necessary cache and restructure GitHub Actions yaml > -- > > Key: SPARK-33464 > URL: https://issues.apache.org/jira/browse/SPARK-33464 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, GitHub Actions build has some unnecessary cache/commands. For > example, if you run SBT only .m2 cache is not needed. We should clean up and > re-organize. > Also, we should add {{~/.sbt}} into cache. See > https://github.com/sbt/sbt/issues/3681 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml
[ https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33464: - Description: Currently, GitHub Actions build has some unnecessary cache/commands. For example, if you run SBT only .m2 cache is not needed. We should clean up and re-organize. Also, we should add {{~/.sbt}} into cache. See https://github.com/sbt/sbt/issues/3681 was:Currently, GitHub Actions build has some unnecessary cache/commands. For example, if you run SBT only .m2 cache is not needed. We should clean up and re-organize. > Add/remove (un)necessary cache and restructure GitHub Actions yaml > -- > > Key: SPARK-33464 > URL: https://issues.apache.org/jira/browse/SPARK-33464 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, GitHub Actions build has some unnecessary cache/commands. For > example, if you run SBT only .m2 cache is not needed. We should clean up and > re-organize. > Also, we should add {{~/.sbt}} into cache. See > https://github.com/sbt/sbt/issues/3681 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33464) Add/remove (un)necessary cache and restructure GitHub Actions yaml
[ https://issues.apache.org/jira/browse/SPARK-33464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33464: - Summary: Add/remove (un)necessary cache and restructure GitHub Actions yaml (was: Remove unnecessary cache and restructure) > Add/remove (un)necessary cache and restructure GitHub Actions yaml > -- > > Key: SPARK-33464 > URL: https://issues.apache.org/jira/browse/SPARK-33464 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, GitHub Actions build has some unnecessary cache/commands. For > example, if you run SBT only .m2 cache is not needed. We should clean up and > re-organize. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33395) Spark reading data in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-33395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233249#comment-17233249 ] Nilesh Patil commented on SPARK-33395: -- [~zhangway], Hi, I am expecting out like below. DAta 1200404151072.1211 1200404151073 1200404151074.1232323 1200404151075.124344 1200404151076.12 1200404151077.12343 1200404151078.12 1200404151079.12544545454554 1251080.123444 1 Code we are using dataset = sparkSession.option("header",true).option("multiLine", true) .option("inferSchema",true) .csv(filePathSeq); > Spark reading data in scientific notation > - > > Key: SPARK-33395 > URL: https://issues.apache.org/jira/browse/SPARK-33395 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Critical > > File is having below data > DAta > 1200404151072.1211 > 1200404151073 > 1200404151074.1232323 > 1200404151075.124344 > 1200404151076.12 > 1200404151077.12343 > 1200404151078.12 > 1200404151079.12544545454554 > 1251080.123444 > 1 > > Spark is reading with scientific notation as we wanted to read data as it is > available in file with accurate datatype not with string datatype. > ++ > | DAta| > ++ > |1.200404151072121E12| > | 1.200404151073E12| > |1.200404151074123...| > |1.200404151075124...| > | 1.20040415107612E12| > |1.200404151077123...| > | 1.20040415107812E12| > |1.200404151079125...| > | 1251080.123445| > | 1.0E28| > + > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33464) Remove unnecessary cache and restructure
Hyukjin Kwon created SPARK-33464: Summary: Remove unnecessary cache and restructure Key: SPARK-33464 URL: https://issues.apache.org/jira/browse/SPARK-33464 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 2.4.8, 3.0.2, 3.1.0 Reporter: Hyukjin Kwon Currently, GitHub Actions build has some unnecessary cache/commands. For example, if you run SBT only .m2 cache is not needed. We should clean up and re-organize. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33454) Add GitHub Action job for Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-33454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33454: - Parent: SPARK-32244 Issue Type: Sub-task (was: New Feature) > Add GitHub Action job for Hadoop 2 > -- > > Key: SPARK-33454 > URL: https://issues.apache.org/jira/browse/SPARK-33454 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > This issue aims to prevent accidental compilation error with Hadoop 2 profile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233227#comment-17233227 ] Prashant Sharma commented on SPARK-30985: - Thanks [~dongjoon], you have resolved the confusion I had. Indeed, this is what was intended. > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33379) The link address of ‘’this page’’ in docs/pyspark-migration-guide.md is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33379. -- Resolution: Not A Problem > The link address of ‘’this page’’ in docs/pyspark-migration-guide.md is > incorrect > - > > Key: SPARK-33379 > URL: https://issues.apache.org/jira/browse/SPARK-33379 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.0.1 >Reporter: 董可伦 >Priority: Major > Labels: documentation > Attachments: SPARK-33379.png > > Original Estimate: 72h > Remaining Estimate: 72h > > > > The link address of ‘’this page’’ in +docs/pyspark-migration-guide.md+ is > incorrect, and it will show ""Not Found The requested URL was not found on > this server."" > [[https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md]|[https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md]] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-33209: Assignee: Cheng Su > Clean up unit test file UnsupportedOperationsSuite.scala > > > Key: SPARK-33209 > URL: https://issues.apache.org/jira/browse/SPARK-33209 > Project: Spark > Issue Type: Sub-task > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > > As a follow up from [https://github.com/apache/spark/pull/30076,] there are > many copy-paste in the unit test file UnsupportedOperationsSuite.scala to > check different join types (inner, outer, semi) with similar code structure. > It would be helpful to clean them up and refactor to reuse code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-33209. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30347 [https://github.com/apache/spark/pull/30347] > Clean up unit test file UnsupportedOperationsSuite.scala > > > Key: SPARK-33209 > URL: https://issues.apache.org/jira/browse/SPARK-33209 > Project: Spark > Issue Type: Sub-task > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > As a follow up from [https://github.com/apache/spark/pull/30076,] there are > many copy-paste in the unit test file UnsupportedOperationsSuite.scala to > check different join types (inner, outer, semi) with similar code structure. > It would be helpful to clean them up and refactor to reuse code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33445. --- Resolution: Cannot Reproduce I'll close this for now, [~bullsoverbears] . However, you are welcome to reopen this with the reproducible example. Thanks for reporting. > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7, 3.0.0 >Reporter: Punit Shah >Priority: Major > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233191#comment-17233191 ] Dongjoon Hyun commented on SPARK-33445: --- PySpark is also working. {code:java} >>> mydf2 = spark.read.csv("tsd.csv", header=True, inferSchema=True) >>> mydf2.schema StructType(List(StructField(Epoch Miliseconds,DoubleType,true))) >>> sc.version '3.0.1' {code} > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7, 3.0.0 >Reporter: Punit Shah >Priority: Major > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233189#comment-17233189 ] Dongjoon Hyun commented on SPARK-33445: --- [~bullsoverbears]. For me, I don't see `ValueError`. Are the affected versions correct? *Apache Spark 2.4.7* {code} scala> spark.version res0: String = 2.4.7 scala> spark.read.option("header", "true").option("inferSchema", "true").csv("tsd.csv").printSchema root |-- Epoch Miliseconds: decimal(6,-7) (nullable = true) {code} *Apache Spark 3.0.0* {code} scala> spark.version res0: String = 3.0.0 scala> spark.read.option("header", "true").option("inferSchema", "true").csv("tsd.csv").printSchema root |-- Epoch Miliseconds: double (nullable = true) {code} *Apache Spark 3.0.1* {code} scala> spark.version res0: String = 3.0.1 scala> spark.read.option("header", "true").option("inferSchema", "true").csv("tsd.csv").printSchema root |-- Epoch Miliseconds: double (nullable = true) {code} > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7, 3.0.0 >Reporter: Punit Shah >Priority: Major > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33445: -- Component/s: (was: PySpark) > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7, 3.0.0 >Reporter: Punit Shah >Priority: Major > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33449) Add cache for Parquet Metadata
[ https://issues.apache.org/jira/browse/SPARK-33449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233184#comment-17233184 ] Dongjoon Hyun commented on SPARK-33449: --- Is this targeting at Apache Spark 3.1, [~yumwang]? > Add cache for Parquet Metadata > -- > > Key: SPARK-33449 > URL: https://issues.apache.org/jira/browse/SPARK-33449 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Get Parquet metadata.png > > > Get Parquet metadata may takes a lot of time, maybe we can cache it. Presto > support it: > https://github.com/prestodb/presto/pull/15276 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33399) Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes
[ https://issues.apache.org/jira/browse/SPARK-33399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33399. -- Fix Version/s: 3.1.0 Assignee: Prakhar Jain Resolution: Fixed Resolved by https://github.com/apache/spark/pull/30300 > Normalize output partitioning and sortorder with respect to aliases to avoid > unneeded exchange/sort nodes > - > > Key: SPARK-33399 > URL: https://issues.apache.org/jira/browse/SPARK-33399 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Prakhar Jain >Assignee: Prakhar Jain >Priority: Major > Fix For: 3.1.0 > > > Spark introduces unneeded exchanges if there is a Project after Inner join. > Example: > > {noformat} > spark.range(10).repartition($"id").createTempView("t1") > spark.range(20).repartition($"id").createTempView("t2") > spark.range(30).repartition($"id").createTempView("t3") > val planned = sql( >""" > |SELECT t2id, t3.id as t3id > |FROM ( > |SELECT t1.id as t1id, t2.id as t2id > |FROM t1, t2 > |WHERE t1.id = t2.id > |) t12, t3 > |WHERE t1id = t3.id >""".stripMargin).queryExecution.executedPlan > *(9) Project [t2id#1034L, id#1004L AS t3id#1035L] > +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner >:- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] > <--- >: +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] >:+- *(5) SortMergeJoin [id#996L], [id#1000L], Inner >: :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0 >: : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] >: : +- *(1) Range (0, 10, step=1, splits=2) >: +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] >: +- *(3) Range (0, 20, step=1, splits=2) >+- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] > +- *(7) Range (0, 30, step=1, splits=2){noformat} > The marked exchange in the above plan can be removed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers
[ https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-23499: - Assignee: Pascal GILLET > Mesos Cluster Dispatcher should support priority queues to submit drivers > - > > Key: SPARK-23499 > URL: https://issues.apache.org/jira/browse/SPARK-23499 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Pascal GILLET >Assignee: Pascal GILLET >Priority: Major > Attachments: Screenshot from 2018-02-28 17-22-47.png > > > As for Yarn, Mesos users should be able to specify priority queues to define > a workload management policy for queued drivers in the Mesos Cluster > Dispatcher. > Submitted drivers are *currently* kept in order of their submission: the > first driver added to the queue will be the first one to be executed (FIFO). > Each driver could have a "priority" associated with it. A driver with high > priority is served (Mesos resources) before a driver with low priority. If > two drivers have the same priority, they are served according to their submit > date in the queue. > To set up such priority queues, the following changes are proposed: > * The Mesos Cluster Dispatcher can optionally be configured with the > _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a > float as value. This adds a new queue named _QueueName_ for submitted drivers > with the specified priority. > Higher numbers indicate higher priority. > The user can then specify multiple queues. > * A driver can be submitted to a specific queue with > _spark.mesos.dispatcher.queue_. This property takes the name of a queue > previously declared in the dispatcher as value. > By default, the dispatcher has a single "default" queue with 0.0 priority > (cannot be overridden). If none of the properties above are specified, the > behavior is the same as the current one (i.e. simple FIFO). > Additionaly, it is possible to implement a consistent and overall workload > management policy throughout the lifecycle of drivers by mapping these > priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in > the dispatcher to the final states in the Mesos cluster), and by specifying a > _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when > submitting an application. > For example, with the URGENT Mesos role: > {code:java} > # Conf on the dispatcher side > spark.mesos.dispatcher.queue.URGENT=1.0 > # Conf on the driver side > spark.mesos.dispatcher.queue=URGENT > spark.mesos.role=URGENT > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers
[ https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-23499. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30352 [https://github.com/apache/spark/pull/30352] > Mesos Cluster Dispatcher should support priority queues to submit drivers > - > > Key: SPARK-23499 > URL: https://issues.apache.org/jira/browse/SPARK-23499 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Pascal GILLET >Assignee: Pascal GILLET >Priority: Major > Fix For: 3.1.0 > > Attachments: Screenshot from 2018-02-28 17-22-47.png > > > As for Yarn, Mesos users should be able to specify priority queues to define > a workload management policy for queued drivers in the Mesos Cluster > Dispatcher. > Submitted drivers are *currently* kept in order of their submission: the > first driver added to the queue will be the first one to be executed (FIFO). > Each driver could have a "priority" associated with it. A driver with high > priority is served (Mesos resources) before a driver with low priority. If > two drivers have the same priority, they are served according to their submit > date in the queue. > To set up such priority queues, the following changes are proposed: > * The Mesos Cluster Dispatcher can optionally be configured with the > _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a > float as value. This adds a new queue named _QueueName_ for submitted drivers > with the specified priority. > Higher numbers indicate higher priority. > The user can then specify multiple queues. > * A driver can be submitted to a specific queue with > _spark.mesos.dispatcher.queue_. This property takes the name of a queue > previously declared in the dispatcher as value. > By default, the dispatcher has a single "default" queue with 0.0 priority > (cannot be overridden). If none of the properties above are specified, the > behavior is the same as the current one (i.e. simple FIFO). > Additionaly, it is possible to implement a consistent and overall workload > management policy throughout the lifecycle of drivers by mapping these > priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in > the dispatcher to the final states in the Mesos cluster), and by specifying a > _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when > submitting an application. > For example, with the URGENT Mesos role: > {code:java} > # Conf on the dispatcher side > spark.mesos.dispatcher.queue.URGENT=1.0 > # Conf on the driver side > spark.mesos.dispatcher.queue=URGENT > spark.mesos.role=URGENT > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect
[ https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233126#comment-17233126 ] Apache Spark commented on SPARK-33463: -- User 'gumartinm' has created a pull request for this issue: https://github.com/apache/spark/pull/30390 > Spark Thrift Server, keep Job Id when using incremental collect > --- > > Key: SPARK-33463 > URL: https://issues.apache.org/jira/browse/SPARK-33463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1 >Reporter: Gustavo Martin >Priority: Major > Fix For: 3.0.1 > > > When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost > and tracing queries in Spark Thrift Server ends up being too complicated. > By fixing the Job Id, queries and Spark jobs are unequivocally related one to > each other. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect
[ https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33463: Assignee: Apache Spark > Spark Thrift Server, keep Job Id when using incremental collect > --- > > Key: SPARK-33463 > URL: https://issues.apache.org/jira/browse/SPARK-33463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1 >Reporter: Gustavo Martin >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.1 > > > When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost > and tracing queries in Spark Thrift Server ends up being too complicated. > By fixing the Job Id, queries and Spark jobs are unequivocally related one to > each other. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect
[ https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33463: Assignee: (was: Apache Spark) > Spark Thrift Server, keep Job Id when using incremental collect > --- > > Key: SPARK-33463 > URL: https://issues.apache.org/jira/browse/SPARK-33463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1 >Reporter: Gustavo Martin >Priority: Major > Fix For: 3.0.1 > > > When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost > and tracing queries in Spark Thrift Server ends up being too complicated. > By fixing the Job Id, queries and Spark jobs are unequivocally related one to > each other. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect
[ https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gustavo Martin updated SPARK-33463: --- Fix Version/s: (was: 3.0.1) 3.1.0 > Spark Thrift Server, keep Job Id when using incremental collect > --- > > Key: SPARK-33463 > URL: https://issues.apache.org/jira/browse/SPARK-33463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1 >Reporter: Gustavo Martin >Priority: Major > Fix For: 3.1.0 > > > When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost > and tracing queries in Spark Thrift Server ends up being too complicated. > By fixing the Job Id, queries and Spark jobs are unequivocally related one to > each other. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect
[ https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233125#comment-17233125 ] Apache Spark commented on SPARK-33463: -- User 'gumartinm' has created a pull request for this issue: https://github.com/apache/spark/pull/30390 > Spark Thrift Server, keep Job Id when using incremental collect > --- > > Key: SPARK-33463 > URL: https://issues.apache.org/jira/browse/SPARK-33463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1 >Reporter: Gustavo Martin >Priority: Major > Fix For: 3.0.1 > > > When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost > and tracing queries in Spark Thrift Server ends up being too complicated. > By fixing the Job Id, queries and Spark jobs are unequivocally related one to > each other. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect
[ https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233124#comment-17233124 ] Gustavo Martin commented on SPARK-33463: PR: https://github.com/apache/spark/pull/30390 > Spark Thrift Server, keep Job Id when using incremental collect > --- > > Key: SPARK-33463 > URL: https://issues.apache.org/jira/browse/SPARK-33463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1 >Reporter: Gustavo Martin >Priority: Major > Fix For: 3.0.1 > > > When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost > and tracing queries in Spark Thrift Server ends up being too complicated. > By fixing the Job Id, queries and Spark jobs are unequivocally related one to > each other. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect
Gustavo Martin created SPARK-33463: -- Summary: Spark Thrift Server, keep Job Id when using incremental collect Key: SPARK-33463 URL: https://issues.apache.org/jira/browse/SPARK-33463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 2.4.7, 2.3.4 Reporter: Gustavo Martin Fix For: 3.0.1 When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost and tracing queries in Spark Thrift Server ends up being too complicated. By fixing the Job Id, queries and Spark jobs are unequivocally related one to each other. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233058#comment-17233058 ] Dongjoon Hyun edited comment on SPARK-33183 at 11/16/20, 8:23 PM: -- I added SPARK-23973 as "is caused by". If then, Apache Spark 2.3 seems to be okay. was (Author: dongjoon): I added SPARK-23973 as "is caused by". > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, > 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233058#comment-17233058 ] Dongjoon Hyun edited comment on SPARK-33183 at 11/16/20, 8:23 PM: -- I added SPARK-23973 as "is caused by". If then, Apache Spark 2.3 seems to be okay. Please let me know if this affects older Sparks. was (Author: dongjoon): I added SPARK-23973 as "is caused by". If then, Apache Spark 2.3 seems to be okay. > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, > 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233058#comment-17233058 ] Dongjoon Hyun commented on SPARK-33183: --- I added SPARK-23973 as "is caused by". > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33183: -- Affects Version/s: 2.4.0 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, > 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31450) Make ExpressionEncoder thread safe
[ https://issues.apache.org/jira/browse/SPARK-31450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233054#comment-17233054 ] Navin Viswanath commented on SPARK-31450: - [~hvanhovell] [~dongjoon] I was in the process of migrating some code from Spark 2.4 to Spark 3 and noticed that this required a change in our code. We use the following process to go from a Thrift type T to InternalRow(reading thrift files on HDFS into a Dataframe): # We construct a Spark schema by inspecting the thrift metadata. # We convert a thrift object to a GenericRow using the thrift metadata to read columns. # We then construct an ExpressionEncoder[Row] and use it to create an InternalRow as follows: {code:java} val schema: StructType = ... // infer thrift schema val encoder: ExpressionEncoder[Row] = RowEncoder(schema) val genericRow: GenericRow = toGenericRow(thriftObject, schema) val internalRow: InternalRow = encoder.toRow(genericRow) {code} The above steps are used to implement {code:java} protected def buildReader( sparkSession: SparkSession, dataSchema: StructType, partitionSchema: StructType, requiredSchema: StructType, filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] {code} in trait org.apache.spark.sql.execution.datasources.FileFormat where we need an Iterator[InternalRow]. With the change in this ticket, I would have to replace {code:java} val internalRow: InternalRow = encoder.toRow(genericRow) {code} with {code:java} val serializer = encoder.createSerializer() val internalRow: InternalRow = serializer(genericRow){code} Since this is marked as an internal API in the PR, I was wondering if there is a way to implement this so that it is compatible with both Spark 2.4 and Spark 3. My goal is to not require a code change if possible. It seems to me that since I know the schema of the thrift type it should be possible to construct an InternalRow, but I don't see a way to do this in the code base. > Make ExpressionEncoder thread safe > -- > > Key: SPARK-31450 > URL: https://issues.apache.org/jira/browse/SPARK-31450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.0.0 > > > ExpressionEncoder is currently not thread-safe because it contains stateful > objects that are required for converting objects to internal rows and vise > versa. We have been working around this by (excessively) cloning > ExpressionEncoders which is not free. I propose that we move the stateful > bits of the expression encoder into two helper classes that will take care of > the conversions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33435) DSv2: REFRESH TABLE should invalidate caches
[ https://issues.apache.org/jira/browse/SPARK-33435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33435: -- Labels: DSv2 correctness (was: DSv2) > DSv2: REFRESH TABLE should invalidate caches > > > Key: SPARK-33435 > URL: https://issues.apache.org/jira/browse/SPARK-33435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: DSv2, correctness > Fix For: 3.0.2, 3.1.0 > > > Currently, in DSv2 {{RefreshTableExec}}, we only invalidate metadata cache > but not all the caches that referencing the table to be refreshed. This may > cause correctness issue if these caches go stale and get queried later. > Note that since we don't support caching a v2 table yet, we can't recache the > table itself at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33462) ResourceProfile use Int for memory in ExecutorResourcesOrDefaults
Thomas Graves created SPARK-33462: - Summary: ResourceProfile use Int for memory in ExecutorResourcesOrDefaults Key: SPARK-33462 URL: https://issues.apache.org/jira/browse/SPARK-33462 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Thomas Graves A followup from SPARK-33288, since memory is in MB we can just store as Int rather then Long in ExecutorResourcesOrDefaults. See https://github.com/apache/spark/pull/30375#issuecomment-728270233 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33460) Accessing map values should fail if key is not found.
[ https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33460. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30386 [https://github.com/apache/spark/pull/30386] > Accessing map values should fail if key is not found. > - > > Key: SPARK-33460 > URL: https://issues.apache.org/jira/browse/SPARK-33460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > > When ansi mode enabled, accessing map values should failed with exception if > key does not exist , but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33460) Accessing map values should fail if key is not found.
[ https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33460: --- Assignee: Leanken.Lin > Accessing map values should fail if key is not found. > - > > Key: SPARK-33460 > URL: https://issues.apache.org/jira/browse/SPARK-33460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > > When ansi mode enabled, accessing map values should failed with exception if > key does not exist , but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32222) Add K8s IT for conf propagation
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-3: -- Summary: Add K8s IT for conf propagation (was: Add integration tests) > Add K8s IT for conf propagation > --- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > > An integration test by placing a configuration file in SPARK_CONF_DIR, and > verifying it is loaded on the executors in both client and cluster deploy > mode. > For this, a log4j.properties file is a good candidate for testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33453) Unify v1 and v2 SHOW PARTITIONS tests
[ https://issues.apache.org/jira/browse/SPARK-33453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33453. - Fix Version/s: 3.1.0 Assignee: Maxim Gekk Resolution: Fixed > Unify v1 and v2 SHOW PARTITIONS tests > - > > Key: SPARK-33453 > URL: https://issues.apache.org/jira/browse/SPARK-33453 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > Gather common tests for DSv1 and DSv2 SHOW PARTITIONS command to a common > test. Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32222) Add integration tests
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-3: - Assignee: Prashant Sharma > Add integration tests > - > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > > An integration test by placing a configuration file in SPARK_CONF_DIR, and > verifying it is loaded on the executors in both client and cluster deploy > mode. > For this, a log4j.properties file is a good candidate for testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33461) Propagating SPARK_CONF_DIR in K8s and tests
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232864#comment-17232864 ] Dongjoon Hyun commented on SPARK-33461: --- I reorganized this for you as an example, [~prashant]. I hope that is what you wanted~ Please let me know if you want to reorganize. > Propagating SPARK_CONF_DIR in K8s and tests > --- > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33461) Propagating SPARK_CONF_DIR in K8s and tests
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33461: -- Fix Version/s: (was: 3.1.0) > Propagating SPARK_CONF_DIR in K8s and tests > --- > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33461) Propagating SPARK_CONF_DIR in K8s and tests
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33461: -- Summary: Propagating SPARK_CONF_DIR in K8s and tests (was: Foundational work for propagating SPARK_CONF_DIR) > Propagating SPARK_CONF_DIR in K8s and tests > --- > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30985: -- Parent: SPARK-33461 Issue Type: Sub-task (was: Improvement) > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32223) Support adding a user provided config map.
[ https://issues.apache.org/jira/browse/SPARK-32223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32223: -- Parent: SPARK-33461 (was: SPARK-30985) > Support adding a user provided config map. > -- > > Key: SPARK-32223 > URL: https://issues.apache.org/jira/browse/SPARK-32223 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > The semantics of this will be discussed and added soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33461: -- Parent: (was: SPARK-30985) Issue Type: Improvement (was: Sub-task) > Foundational work for propagating SPARK_CONF_DIR > > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.
[ https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32221: -- Parent: SPARK-33461 (was: SPARK-30985) > Avoid possible errors due to incorrect file size or type supplied in spark > conf. > > > Key: SPARK-32221 > URL: https://issues.apache.org/jira/browse/SPARK-32221 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > This would avoid failures, in case the files are a bit large or a user places > a binary file inside the SPARK_CONF_DIR. > Both of which are not supported at the moment. > The reason is, underlying etcd store does limit the size of each entry to > only 1 MiB. Once etcd is upgraded in all the popular k8s clusters, then we > can hope to overcome this limitation. e.g. > [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for > higher limit on each entry. > Even if that does not happen, there are other ways to overcome this > limitation, for example, we can have config files split across multiple > configMaps. We need to discuss, and prioritise, this issue takes the > straightforward approach of skipping files that cannot be accommodated within > 1MiB limit and WARNING the user about the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32222) Add integration tests
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-3: -- Parent: SPARK-33461 (was: SPARK-30985) > Add integration tests > - > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > An integration test by placing a configuration file in SPARK_CONF_DIR, and > verifying it is loaded on the executors in both client and cluster deploy > mode. > For this, a log4j.properties file is a good candidate for testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232860#comment-17232860 ] Dongjoon Hyun commented on SPARK-33461: --- The PR is merged by SPARK-30985, not SPARK-33461. The JIRA ID is important, [~prashant]. You can make this issue as an umbrella instead. > Foundational work for propagating SPARK_CONF_DIR > > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-33461: --- > Foundational work for propagating SPARK_CONF_DIR > > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232852#comment-17232852 ] Dongjoon Hyun commented on SPARK-30985: --- Hi, [~prashant]. You should not reopen this. We cannot change the commit log. If you really need an umbrella JIRA, please create one and move this and all subtasks into that new umbrella JIRA. > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30985. --- Assignee: Prashant Sharma Resolution: Fixed > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33389) make internal classes of SparkSession always using active SQLConf
[ https://issues.apache.org/jira/browse/SPARK-33389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33389. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30299 [https://github.com/apache/spark/pull/30299] > make internal classes of SparkSession always using active SQLConf > - > > Key: SPARK-33389 > URL: https://issues.apache.org/jira/browse/SPARK-33389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Assignee: Lu Lu >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33389) make internal classes of SparkSession always using active SQLConf
[ https://issues.apache.org/jira/browse/SPARK-33389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33389: --- Assignee: Lu Lu > make internal classes of SparkSession always using active SQLConf > - > > Key: SPARK-33389 > URL: https://issues.apache.org/jira/browse/SPARK-33389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Assignee: Lu Lu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33143) Make SocketAuthServer socket timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232824#comment-17232824 ] Gabor Somogyi commented on SPARK-33143: --- [~mszurap] the OS and network guys are still working on it but one thing seems sure. It has nothing to do with the RDD size. It's reproducible w/ relatively small RDDs. > Make SocketAuthServer socket timeout configurable > - > > Key: SPARK-33143 > URL: https://issues.apache.org/jira/browse/SPARK-33143 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Miklos Szurap >Priority: Major > > In SPARK-21551 the socket timeout for the Pyspark applications has been > increased from 3 to 15 seconds. However it is still hardcoded. > In certain situations even the 15 seconds is not enough, so it should be made > configurable. > This is requested after seeing it in real-life workload failures. > Also it has been suggested and requested in an earlier comment in > [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498] > In > Spark 2.4 it is under > [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899] > in Spark 3.x the code has been moved to > [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51] > {code} > serverSocket.setSoTimeout(15000) > {code} > Please include this in both 2.4 and 3.x branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33143) Make SocketAuthServer socket timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232818#comment-17232818 ] Apache Spark commented on SPARK-33143: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/30389 > Make SocketAuthServer socket timeout configurable > - > > Key: SPARK-33143 > URL: https://issues.apache.org/jira/browse/SPARK-33143 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Miklos Szurap >Priority: Major > > In SPARK-21551 the socket timeout for the Pyspark applications has been > increased from 3 to 15 seconds. However it is still hardcoded. > In certain situations even the 15 seconds is not enough, so it should be made > configurable. > This is requested after seeing it in real-life workload failures. > Also it has been suggested and requested in an earlier comment in > [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498] > In > Spark 2.4 it is under > [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899] > in Spark 3.x the code has been moved to > [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51] > {code} > serverSocket.setSoTimeout(15000) > {code} > Please include this in both 2.4 and 3.x branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33143) Make SocketAuthServer socket timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33143: Assignee: (was: Apache Spark) > Make SocketAuthServer socket timeout configurable > - > > Key: SPARK-33143 > URL: https://issues.apache.org/jira/browse/SPARK-33143 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Miklos Szurap >Priority: Major > > In SPARK-21551 the socket timeout for the Pyspark applications has been > increased from 3 to 15 seconds. However it is still hardcoded. > In certain situations even the 15 seconds is not enough, so it should be made > configurable. > This is requested after seeing it in real-life workload failures. > Also it has been suggested and requested in an earlier comment in > [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498] > In > Spark 2.4 it is under > [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899] > in Spark 3.x the code has been moved to > [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51] > {code} > serverSocket.setSoTimeout(15000) > {code} > Please include this in both 2.4 and 3.x branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33143) Make SocketAuthServer socket timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33143: Assignee: Apache Spark > Make SocketAuthServer socket timeout configurable > - > > Key: SPARK-33143 > URL: https://issues.apache.org/jira/browse/SPARK-33143 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Miklos Szurap >Assignee: Apache Spark >Priority: Major > > In SPARK-21551 the socket timeout for the Pyspark applications has been > increased from 3 to 15 seconds. However it is still hardcoded. > In certain situations even the 15 seconds is not enough, so it should be made > configurable. > This is requested after seeing it in real-life workload failures. > Also it has been suggested and requested in an earlier comment in > [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498] > In > Spark 2.4 it is under > [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899] > in Spark 3.x the code has been moved to > [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51] > {code} > serverSocket.setSoTimeout(15000) > {code} > Please include this in both 2.4 and 3.x branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32222) Add integration tests
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3: Assignee: (was: Apache Spark) > Add integration tests > - > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > An integration test by placing a configuration file in SPARK_CONF_DIR, and > verifying it is loaded on the executors in both client and cluster deploy > mode. > For this, a log4j.properties file is a good candidate for testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32222) Add integration tests
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232762#comment-17232762 ] Apache Spark commented on SPARK-3: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/30388 > Add integration tests > - > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > An integration test by placing a configuration file in SPARK_CONF_DIR, and > verifying it is loaded on the executors in both client and cluster deploy > mode. > For this, a log4j.properties file is a good candidate for testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32222) Add integration tests
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3: Assignee: Apache Spark > Add integration tests > - > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Apache Spark >Priority: Major > > An integration test by placing a configuration file in SPARK_CONF_DIR, and > verifying it is loaded on the executors in both client and cluster deploy > mode. > For this, a log4j.properties file is a good candidate for testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp
[ https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232671#comment-17232671 ] Pablo Cocko commented on SPARK-17914: - I think that the solution only apply if there are more than 6 digits as miliseconds. https://github.com/apache/spark/pull/18252/commits/2f232a7bda28fb42759ee35923044f886a1ff19e > Spark SQL casting to TimestampType with nanosecond results in incorrect > timestamp > - > > Key: SPARK-17914 > URL: https://issues.apache.org/jira/browse/SPARK-17914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Oksana Romankova >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 2.2.0, 2.3.0 > > > In some cases when timestamps contain nanoseconds they will be parsed > incorrectly. > Examples: > "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567" > "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678" > The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It > assumes that only 6 digit fraction of a second will be passed. > With this being the case I would suggest either discarding nanoseconds > automatically, or throw an exception prompting to pre-format timestamps to > microsecond precision first before casting to the Timestamp. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
[ https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33443: Assignee: (was: Apache Spark) > LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ] > > > Key: SPARK-33443 > URL: https://issues.apache.org/jira/browse/SPARK-33443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the > mainstream database support this syntax. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
[ https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232664#comment-17232664 ] Apache Spark commented on SPARK-33443: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30387 > LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ] > > > Key: SPARK-33443 > URL: https://issues.apache.org/jira/browse/SPARK-33443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the > mainstream database support this syntax. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
[ https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33443: Assignee: Apache Spark > LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ] > > > Key: SPARK-33443 > URL: https://issues.apache.org/jira/browse/SPARK-33443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the > mainstream database support this syntax. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
[ https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232665#comment-17232665 ] Apache Spark commented on SPARK-33443: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30387 > LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ] > > > Key: SPARK-33443 > URL: https://issues.apache.org/jira/browse/SPARK-33443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the > mainstream database support this syntax. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33443) LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
[ https://issues.apache.org/jira/browse/SPARK-33443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-33443: --- Description: The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the mainstream database support this syntax. (was: The current implement of LEAD/LAG could not support IGNORE/RESPECT NULLS, but the mainstream database support this syntax.) > LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ] > > > Key: SPARK-33443 > URL: https://issues.apache.org/jira/browse/SPARK-33443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of LEAD/LAG don't support IGNORE/RESPECT NULLS, but the > mainstream database support this syntax. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33460) Accessing map values should fail if key is not found.
[ https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33460: Assignee: Apache Spark > Accessing map values should fail if key is not found. > - > > Key: SPARK-33460 > URL: https://issues.apache.org/jira/browse/SPARK-33460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Apache Spark >Priority: Major > > When ansi mode enabled, accessing map values should failed with exception if > key does not exist , but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33460) Accessing map values should fail if key is not found.
[ https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232658#comment-17232658 ] Apache Spark commented on SPARK-33460: -- User 'leanken' has created a pull request for this issue: https://github.com/apache/spark/pull/30386 > Accessing map values should fail if key is not found. > - > > Key: SPARK-33460 > URL: https://issues.apache.org/jira/browse/SPARK-33460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > When ansi mode enabled, accessing map values should failed with exception if > key does not exist , but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33460) Accessing map values should fail if key is not found.
[ https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33460: Assignee: (was: Apache Spark) > Accessing map values should fail if key is not found. > - > > Key: SPARK-33460 > URL: https://issues.apache.org/jira/browse/SPARK-33460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > When ansi mode enabled, accessing map values should failed with exception if > key does not exist , but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232631#comment-17232631 ] Prashant Sharma edited comment on SPARK-30985 at 11/16/20, 9:21 AM: Reopening this JIRA as this was closed because, I created a PR incorrectly targeting the umbrella JIRA instead of the subtask : was (Author: prashant_): Reopening this JIRA as, this is Umbrella jira and I created a PR incorrectly targeting the umbrella JIRA instead of the subtask : > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reassigned SPARK-30985: --- Assignee: (was: Prashant Sharma) > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reassigned SPARK-33461: --- Assignee: Prashant Sharma > Foundational work for propagating SPARK_CONF_DIR > > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-33461. - Fix Version/s: 3.1.0 Resolution: Fixed > Foundational work for propagating SPARK_CONF_DIR > > > Key: SPARK-33461 > URL: https://issues.apache.org/jira/browse/SPARK-33461 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33461) Foundational work for propagating SPARK_CONF_DIR
Prashant Sharma created SPARK-33461: --- Summary: Foundational work for propagating SPARK_CONF_DIR Key: SPARK-33461 URL: https://issues.apache.org/jira/browse/SPARK-33461 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 3.1.0 Reporter: Prashant Sharma Foundational work for propagating SPARK_CONF_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reopened SPARK-30985: - Reopening this JIRA as, this is Umbrella jira and I created a PR incorrectly targeting the umbrella JIRA instead of the subtask : > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232628#comment-17232628 ] Prashant Sharma commented on SPARK-30985: - [~dongjoon] Hm.. my mistake. As you said, I added subtasks after creating the PR and this JIRA. I will re-open this JIRA and create a subtask and resolve it. > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232601#comment-17232601 ] Dongjoon Hyun commented on SPARK-30985: --- [~prashant]. You should not make a PR with the umbrella JIRA. The umbrella Jira is resolved when the all subtask JIRAs are resolved. If you make a PR with the umbrella Jira ID, Spark merge script resolves it. Since this seems to be created long time ago, the Jira management is up to you. The above comment is just my suggestion. > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232601#comment-17232601 ] Dongjoon Hyun edited comment on SPARK-30985 at 11/16/20, 8:14 AM: -- [~prashant]. You should not make a PR with the umbrella JIRA. The umbrella Jira should be resolved when the all subtask JIRAs are resolved. However, if you make a PR with the umbrella Jira ID, Spark merge script resolves it. Since this seems to be created long time ago, the Jira management is up to you. The above comment is just my suggestion. was (Author: dongjoon): [~prashant]. You should not make a PR with the umbrella JIRA. The umbrella Jira is resolved when the all subtask JIRAs are resolved. If you make a PR with the umbrella Jira ID, Spark merge script resolves it. Since this seems to be created long time ago, the Jira management is up to you. The above comment is just my suggestion. > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232600#comment-17232600 ] Dongjoon Hyun edited comment on SPARK-30985 at 11/16/20, 8:08 AM: -- ~It's a little weird to convert a merged JIRA into an umbrella.~ Hmm. It seems that I was confused the history. was (Author: dongjoon): It's a little weird to convert a merged JIRA into an umbrella. > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org