[jira] [Commented] (SPARK-23758) MLlib 2.4 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889370#comment-16889370 ] Dongjoon Hyun commented on SPARK-23758: --- Thank you so much, [~josephkb]! > MLlib 2.4 Roadmap > - > > Key: SPARK-23758 > URL: https://issues.apache.org/jira/browse/SPARK-23758 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Minor | optional | no | maybe | > | [5 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Trivial | optional | no | maybe | > | [6 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] > | (empty) | (any) | yes | no | maybe | > | [7 | >
[jira] [Commented] (SPARK-28155) Improve SQL optimizer's predicate pushdown performance for cascading joins
[ https://issues.apache.org/jira/browse/SPARK-28155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889362#comment-16889362 ] Dongjoon Hyun commented on SPARK-28155: --- This is committed with a wrong JIRA ID, `SPARK-28155`. > Improve SQL optimizer's predicate pushdown performance for cascading joins > -- > > Key: SPARK-28155 > URL: https://issues.apache.org/jira/browse/SPARK-28155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Assignee: Yesheng Ma >Priority: Major > Fix For: 3.0.0 > > > The current catalyst optimizer's predicate pushdown is divided into two > separate rules: PushDownPredicate and PushThroughJoin. This is not efficient > for optimizing cascading joins such as TPC-DS q64, where a whole default > batch is re-executed just due to this. We need a more efficient approach to > pushdown predicate as much as possible in a single pass. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28433) Incorrect assertion in scala test for aarch64 platform
[ https://issues.apache.org/jira/browse/SPARK-28433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889263#comment-16889263 ] Dongjoon Hyun commented on SPARK-28433: --- I removed `2.4.3` from the affected versions because there is no test using those assertions in `branch-2.4`. > Incorrect assertion in scala test for aarch64 platform > -- > > Key: SPARK-28433 > URL: https://issues.apache.org/jira/browse/SPARK-28433 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: huangtianhua >Assignee: huangtianhua >Priority: Minor > Fix For: 3.0.0 > > > We ran unit tests of spark on aarch64 server, here are two sql scala tests > failed: > - SPARK-26021: NaN and -0.0 in grouping expressions *** FAILED *** >2143289344 equaled 2143289344 (DataFrameAggregateSuite.scala:732) > - NaN and -0.0 in window partition keys *** FAILED *** >2143289344 equaled 2143289344 (DataFrameWindowFunctionsSuite.scala:704) > we found the values of floatToRawIntBits(0.0f / 0.0f) and > floatToRawIntBits(Float.NaN) on aarch64 are same(2143289344), first we > thought it's something about jdk or scala, but after discuss with jdk-dev and > scala community see > https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845 > , we believe the value depends on the architecture. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28433) Incorrect assertion in scala test for aarch64 platform
[ https://issues.apache.org/jira/browse/SPARK-28433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28433: -- Affects Version/s: (was: 2.4.3) > Incorrect assertion in scala test for aarch64 platform > -- > > Key: SPARK-28433 > URL: https://issues.apache.org/jira/browse/SPARK-28433 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: huangtianhua >Assignee: huangtianhua >Priority: Minor > Fix For: 3.0.0 > > > We ran unit tests of spark on aarch64 server, here are two sql scala tests > failed: > - SPARK-26021: NaN and -0.0 in grouping expressions *** FAILED *** >2143289344 equaled 2143289344 (DataFrameAggregateSuite.scala:732) > - NaN and -0.0 in window partition keys *** FAILED *** >2143289344 equaled 2143289344 (DataFrameWindowFunctionsSuite.scala:704) > we found the values of floatToRawIntBits(0.0f / 0.0f) and > floatToRawIntBits(Float.NaN) on aarch64 are same(2143289344), first we > thought it's something about jdk or scala, but after discuss with jdk-dev and > scala community see > https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845 > , we believe the value depends on the architecture. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28433) Incorrect assertion in scala test for aarch64 platform
[ https://issues.apache.org/jira/browse/SPARK-28433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28433. --- Resolution: Fixed Assignee: huangtianhua Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/25186 > Incorrect assertion in scala test for aarch64 platform > -- > > Key: SPARK-28433 > URL: https://issues.apache.org/jira/browse/SPARK-28433 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: huangtianhua >Assignee: huangtianhua >Priority: Minor > Fix For: 3.0.0 > > > We ran unit tests of spark on aarch64 server, here are two sql scala tests > failed: > - SPARK-26021: NaN and -0.0 in grouping expressions *** FAILED *** >2143289344 equaled 2143289344 (DataFrameAggregateSuite.scala:732) > - NaN and -0.0 in window partition keys *** FAILED *** >2143289344 equaled 2143289344 (DataFrameWindowFunctionsSuite.scala:704) > we found the values of floatToRawIntBits(0.0f / 0.0f) and > floatToRawIntBits(Float.NaN) on aarch64 are same(2143289344), first we > thought it's something about jdk or scala, but after discuss with jdk-dev and > scala community see > https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845 > , we believe the value depends on the architecture. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23758) MLlib 2.4 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23758. --- Resolution: Done > MLlib 2.4 Roadmap > - > > Key: SPARK-23758 > URL: https://issues.apache.org/jira/browse/SPARK-23758 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Joseph K. Bradley >Priority: Major > > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Minor | optional | no | maybe | > | [5 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Trivial | optional | no | maybe | > | [6 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] > | (empty) | (any) | yes | no | maybe | > | [7 | >
[jira] [Updated] (SPARK-23758) MLlib 2.4 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23758: -- Affects Version/s: (was: 3.0.0) 2.4.0 > MLlib 2.4 Roadmap > - > > Key: SPARK-23758 > URL: https://issues.apache.org/jira/browse/SPARK-23758 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Minor | optional | no | maybe | > | [5 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Trivial | optional | no | maybe | > | [6 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] > | (empty) | (any) | yes | no | maybe | > | [7 | >
[jira] [Commented] (SPARK-23758) MLlib 2.4 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889240#comment-16889240 ] Joseph K. Bradley commented on SPARK-23758: --- Ah sorry, we stopped using this. I'll close it. > MLlib 2.4 Roadmap > - > > Key: SPARK-23758 > URL: https://issues.apache.org/jira/browse/SPARK-23758 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Joseph K. Bradley >Priority: Major > > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Minor | optional | no | maybe | > | [5 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Trivial | optional | no | maybe | > | [6 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] > | (empty) | (any) | yes | no | maybe | > | [7 | >
[jira] [Created] (SPARK-28456) Add a public API `Encoder.copyEncoder` to allow creating Encoder without touching Scala reflections
Shixiong Zhu created SPARK-28456: Summary: Add a public API `Encoder.copyEncoder` to allow creating Encoder without touching Scala reflections Key: SPARK-28456 URL: https://issues.apache.org/jira/browse/SPARK-28456 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.3 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Because `Encoder` is not thread safe, the user cannot reuse an `Encoder` in multiple `Dataset`s. However, creating an `Encoder` for a complicated class is slow due to Scala reflections. To reduce the cost of Encoder creation, right now I usually use the private API `ExpressionEncoder.copy` as follows: {code} object FooEncoder { private lazy val _encoder: ExpressionEncoder[Foo] = ExpressionEncoder[Foo]() implicit def encoder: ExpressionEncoder[Foo] = _encoder.copy() } {code} This PR proposes a new method `copyEncoder` in `Encoder` so that the above codes can be rewritten using public APIs. {code} object FooEncoder { private lazy val _encoder: Encoder[Foo] = Encoders.product[Foo]() implicit def encoder: Encoder[Foo] = _encoder.copyEncoder() } {code} Regarding the method name, - Why not use `copy`? It conflicts with `case class`'s copy. - Why not use `clone`? It conflicts with `Object.clone`. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28455) Executor may be timed out too soon because of overflow in tracking code
Marcelo Vanzin created SPARK-28455: -- Summary: Executor may be timed out too soon because of overflow in tracking code Key: SPARK-28455 URL: https://issues.apache.org/jira/browse/SPARK-28455 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Vanzin This affects the new code added in SPARK-27963 (so normal dynamic allocation is fine). There's an overflow issue in that code that may cause executors to be timed out early with the default configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28454) Validate LongType in _make_type_verifier
[ https://issues.apache.org/jira/browse/SPARK-28454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889204#comment-16889204 ] AY commented on SPARK-28454: [https://github.com/apache/spark/pull/25117] - related PR. > Validate LongType in _make_type_verifier > > > Key: SPARK-28454 > URL: https://issues.apache.org/jira/browse/SPARK-28454 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.3 >Reporter: AY >Priority: Major > > {{pyspark.sql.types._make_type_verifier doesn't validate LongType values > range.}} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28454) Validate LongType in _make_type_verifier
AY created SPARK-28454: -- Summary: Validate LongType in _make_type_verifier Key: SPARK-28454 URL: https://issues.apache.org/jira/browse/SPARK-28454 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.3 Reporter: AY {{pyspark.sql.types._make_type_verifier doesn't validate LongType values range.}} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28453) Support recursive view syntax
Peter Toth created SPARK-28453: -- Summary: Support recursive view syntax Key: SPARK-28453 URL: https://issues.apache.org/jira/browse/SPARK-28453 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth PostgreSQL does support recursive view syntax: {noformat} CREATE RECURSIVE VIEW nums (n) AS VALUES (1) UNION ALL SELECT n+1 FROM nums WHERE n < 5 {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888987#comment-16888987 ] Dongjoon Hyun commented on SPARK-28444: --- BTW, I also agree with [~skonto]. I believe the tests will pass because we don't use new features of 1.14. K8s itself and K8s Client library should provide the compatibliity. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888986#comment-16888986 ] Dongjoon Hyun commented on SPARK-28444: --- According to that matrix, it looks reasonable to make a PR because it's not covered officially, [~patrick-winter-swisscard]. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888975#comment-16888975 ] Patrick Winter commented on SPARK-28444: I agree. I will be on holiday for the next few weeks, but will keep you updated once we found out more. Thanks for your help! > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28440) Use TestingUtils to compare floating point values
[ https://issues.apache.org/jira/browse/SPARK-28440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28440: - Assignee: Ievgen Prokhorenko > Use TestingUtils to compare floating point values > - > > Key: SPARK-28440 > URL: https://issues.apache.org/jira/browse/SPARK-28440 > Project: Spark > Issue Type: Improvement > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Ievgen Prokhorenko >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:26 PM: -- Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Afaik this is the same version for 2.4.3. Something else is not right. was (Author: skonto): Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Afaik this is the same version for 2.4.3. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:26 PM: -- Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Afaik this is the same version for 2.4.3. was (Author: skonto): Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Did you try 3.0.0? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:25 PM: -- Right now on master we have [4.1.2 | [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] was (Author: skonto): Right now on master we have [4.1.2 |[https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:25 PM: -- Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Did you try 3.0.0? was (Author: skonto): Right now on master we have [4.1.2 | [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos commented on SPARK-28444: - Right now on master we have [4.1.2 |[https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888965#comment-16888965 ] Patrick Winter commented on SPARK-28444: Using the Kubernetes client in a standalone application works both with version 4.1.2 and 4.3.0. We used the same credentials as for Spark. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes
[ https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28441: Summary: PythonUDF used in correlated scalar subquery causes (was: udf(max(udf(column))) throws java.lang.UnsupportedOperationException: Cannot evaluate expression: udf(null)) > PythonUDF used in correlated scalar subquery causes > > > Key: SPARK-28441 > URL: https://issues.apache.org/jira/browse/SPARK-28441 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > I found this when doing https://issues.apache.org/jira/browse/SPARK-28277 > > {code:java} > >>> @pandas_udf("string", PandasUDFType.SCALAR) > ... def noop(x): > ... return x.apply(str) > ... > >>> spark.udf.register("udf", noop) > > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values > >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)") > DataFrame[] > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values > >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as > >>> t2(k, v)") > DataFrame[] > >>> spark.sql("SELECT t1.k FROM t1 WHERE t1.v <= (SELECT > >>> udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k))").show() > py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > udf(null) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52) > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28441: Summary: PythonUDF used in correlated scalar subquery causes UnsupportedOperationException (was: PythonUDF used in correlated scalar subquery causes ) > PythonUDF used in correlated scalar subquery causes > UnsupportedOperationException > -- > > Key: SPARK-28441 > URL: https://issues.apache.org/jira/browse/SPARK-28441 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > I found this when doing https://issues.apache.org/jira/browse/SPARK-28277 > > {code:java} > >>> @pandas_udf("string", PandasUDFType.SCALAR) > ... def noop(x): > ... return x.apply(str) > ... > >>> spark.udf.register("udf", noop) > > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values > >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)") > DataFrame[] > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values > >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as > >>> t2(k, v)") > DataFrame[] > >>> spark.sql("SELECT t1.k FROM t1 WHERE t1.v <= (SELECT > >>> udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k))").show() > py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > udf(null) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52) > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28441: Priority: Major (was: Minor) > PythonUDF used in correlated scalar subquery causes > UnsupportedOperationException > -- > > Key: SPARK-28441 > URL: https://issues.apache.org/jira/browse/SPARK-28441 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > I found this when doing https://issues.apache.org/jira/browse/SPARK-28277 > > {code:java} > >>> @pandas_udf("string", PandasUDFType.SCALAR) > ... def noop(x): > ... return x.apply(str) > ... > >>> spark.udf.register("udf", noop) > > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values > >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)") > DataFrame[] > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values > >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as > >>> t2(k, v)") > DataFrame[] > >>> spark.sql("SELECT t1.k FROM t1 WHERE t1.v <= (SELECT > >>> udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k))").show() > py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > udf(null) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52) > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:46 PM: -- Am I not sure this is a k8s client version issue, it seems more like a credentials issue. But let's find out. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? Does it work with minikube 1.14? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? Does it work with minikube 1.14? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:46 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? Does it work with minikube 1.14? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:45 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8ios client in different versions? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:45 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8ios client in different versions? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can create pods with a simple app using the fabric8ios client in different versions? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:44 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can create pods with a simple app using the fabric8ios client in different versions? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos commented on SPARK-28444: - Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28452) CSV datasource writer do not support maxCharsPerColumn option
Weichen Xu created SPARK-28452: -- Summary: CSV datasource writer do not support maxCharsPerColumn option Key: SPARK-28452 URL: https://issues.apache.org/jira/browse/SPARK-28452 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Weichen Xu CSV datasource reader support maxCharsPerColumn option, but CSV datasource writer do not support maxCharsPerColumn option. Should we make CSV datasource writer also support maxCharsPerColumn ? So that reader/writer will have consistent behavior on this option. Otherwise user may write a DF to csv successfully but then load it failed. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1653#comment-1653 ] Patrick Winter commented on SPARK-28444: We have been hitting this one previously, but are now already using a user account (.kube/config) on the submit machine, so this does not seem to be the issue. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28289) Convert and port 'union.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1647#comment-1647 ] Yiheng Wang commented on SPARK-28289: - Here's the PR: [https://github.com/apache/spark/pull/25202] [~hyukjin.kwon] > Convert and port 'union.sql' into UDF test base > --- > > Key: SPARK-28289 > URL: https://issues.apache.org/jira/browse/SPARK-28289 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28451) substr returns different values
Yuming Wang created SPARK-28451: --- Summary: substr returns different values Key: SPARK-28451 URL: https://issues.apache.org/jira/browse/SPARK-28451 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Yuming Wang PostgreSQL: {noformat} postgres=# select substr('1234567890', -1, 5); substr 123 (1 row) postgres=# select substr('1234567890', 1, -1); ERROR: negative substring length not allowed {noformat} Spark SQL: {noformat} spark-sql> select substr('1234567890', -1, 5); 0 spark-sql> select substr('1234567890', 1, -1); spark-sql> {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1639#comment-1639 ] Stavros Kontopoulos commented on SPARK-28444: - Probably you are hitting this one: https://issues.apache.org/jira/browse/SPARK-26833 > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1633#comment-1633 ] Patrick Winter commented on SPARK-28444: Running spark-submit does unfortunately not give much more information: Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:201) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:185) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/07/19 12:43:20 INFO ShutdownHookManager: Shutdown hook called 19/07/19 12:43:20 INFO ShutdownHookManager: Deleting directory /tmp/spark-54cf5aa1-7a66-4bb4-8d88-96ac7d2076e2 Running the jar directly we get a little more: 19/07/19 12:45:27 INFO SparkContext: Running Spark version 2.4.2 19/07/19 12:45:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/07/19 12:45:27 INFO SparkContext: Submitted application: bigdataAnalyticsPoC 19/07/19 12:45:28 INFO SecurityManager: Changing view acls to: root 19/07/19 12:45:28 INFO SecurityManager: Changing modify acls to: root 19/07/19 12:45:28 INFO SecurityManager: Changing view acls groups to: 19/07/19 12:45:28 INFO SecurityManager: Changing modify acls groups to: 19/07/19 12:45:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 19/07/19 12:45:28 INFO Utils: Successfully started service 'sparkDriver' on port 40288. 19/07/19 12:45:28 INFO SparkEnv: Registering MapOutputTracker 19/07/19 12:45:28 INFO SparkEnv: Registering BlockManagerMaster 19/07/19 12:45:28 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 19/07/19 12:45:28 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 19/07/19 12:45:28 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-f46c28fd-5c19-441e-9f62-c7d392e2c29a 19/07/19 12:45:28 INFO MemoryStore: MemoryStore started with capacity 2.1 GB 19/07/19 12:45:28 INFO SparkEnv: Registering OutputCommitCoordinator 19/07/19 12:45:28 INFO Utils: Successfully started service 'SparkUI' on port 4040. 19/07/19 12:45:28 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://spark-submit-client-849vj:4040 19/07/19 12:45:29 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes. 19/07/19 12:45:29 WARN WatchConnectionManager: Exec Failure: HTTP 403, Status: 403 - null java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/07/19 12:45:29 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) 19/07/19 12:45:30 ERROR SparkContext: Error initializing SparkContext. io.fabric8.kubernetes.client.KubernetesClientException at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:201) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:185) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/07/19 12:45:30 INFO SparkUI: Stopped Spark web UI at http://spark-submit-client-849vj:4040 19/07/19 12:45:30 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 19/07/19 12:45:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 19/07/19 12:45:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 19/07/19 12:45:30 INFO
[jira] [Updated] (SPARK-28450) When scan hive data of a not existed partition, it return an error
[ https://issues.apache.org/jira/browse/SPARK-28450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-28450: -- Attachment: image-2019-07-19-20-51-12-861.png > When scan hive data of a not existed partition, it return an error > -- > > Key: SPARK-28450 > URL: https://issues.apache.org/jira/browse/SPARK-28450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2019-07-19-20-51-12-861.png > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28450) When scan hive data of a not existed partition, it return an error
[ https://issues.apache.org/jira/browse/SPARK-28450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-28450: -- Description: When we select data of a un-existed hive partition table's partition, it will return error, bu it should just return empty. !image-2019-07-19-20-51-12-861.png! > When scan hive data of a not existed partition, it return an error > -- > > Key: SPARK-28450 > URL: https://issues.apache.org/jira/browse/SPARK-28450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2019-07-19-20-51-12-861.png > > > When we select data of a un-existed hive partition table's partition, it will > return error, bu it should just return empty. > !image-2019-07-19-20-51-12-861.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28450) When scan hive data of a not existed partition, it return an error
angerszhu created SPARK-28450: - Summary: When scan hive data of a not existed partition, it return an error Key: SPARK-28450 URL: https://issues.apache.org/jira/browse/SPARK-28450 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28449) Missing escape_string_warning and standard_conforming_strings config
[ https://issues.apache.org/jira/browse/SPARK-28449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28449: Summary: Missing escape_string_warning and standard_conforming_strings config (was: Missing escape_string_warning/standard_conforming_strings config) > Missing escape_string_warning and standard_conforming_strings config > > > Key: SPARK-28449 > URL: https://issues.apache.org/jira/browse/SPARK-28449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > When on, a warning is issued if a backslash ({{}}) appears in an ordinary > string literal ({{'...'}} syntax) and {{standard_conforming_strings}} is off. > The default is {{on}}. > Applications that wish to use backslash as escape should be modified to use > escape string syntax ({{E'...'}}), because the default behavior of ordinary > strings is now to treat backslash as an ordinary character, per SQL standard. > This variable can be enabled to help locate code that needs to be changed. > > [https://www.postgresql.org/docs/11/runtime-config-compatible.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28449) Missing escape_string_warning/standard_conforming_strings config
[ https://issues.apache.org/jira/browse/SPARK-28449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28449: Description: When on, a warning is issued if a backslash ({{}}) appears in an ordinary string literal ({{'...'}} syntax) and {{standard_conforming_strings}} is off. The default is {{on}}. Applications that wish to use backslash as escape should be modified to use escape string syntax ({{E'...'}}), because the default behavior of ordinary strings is now to treat backslash as an ordinary character, per SQL standard. This variable can be enabled to help locate code that needs to be changed. [https://www.postgresql.org/docs/11/runtime-config-compatible.html] was: When on, a warning is issued if a backslash ({{\}}) appears in an ordinary string literal ({{'...'}} syntax) and {{standard_conforming_strings}} is off. The default is {{on}}. Applications that wish to use backslash as escape should be modified to use escape string syntax ({{E'...'}}), because the default behavior of ordinary strings is now to treat backslash as an ordinary character, per SQL standard. This variable can be enabled to help locate code that needs to be changed. > Missing escape_string_warning/standard_conforming_strings config > > > Key: SPARK-28449 > URL: https://issues.apache.org/jira/browse/SPARK-28449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > When on, a warning is issued if a backslash ({{}}) appears in an ordinary > string literal ({{'...'}} syntax) and {{standard_conforming_strings}} is off. > The default is {{on}}. > Applications that wish to use backslash as escape should be modified to use > escape string syntax ({{E'...'}}), because the default behavior of ordinary > strings is now to treat backslash as an ordinary character, per SQL standard. > This variable can be enabled to help locate code that needs to be changed. > > [https://www.postgresql.org/docs/11/runtime-config-compatible.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28449) Missing escape_string_warning/standard_conforming_strings config
[ https://issues.apache.org/jira/browse/SPARK-28449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28449: Summary: Missing escape_string_warning/standard_conforming_strings config (was: Missing escape_string_warning config) > Missing escape_string_warning/standard_conforming_strings config > > > Key: SPARK-28449 > URL: https://issues.apache.org/jira/browse/SPARK-28449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > When on, a warning is issued if a backslash ({{\}}) appears in an ordinary > string literal ({{'...'}} syntax) and {{standard_conforming_strings}} is off. > The default is {{on}}. > Applications that wish to use backslash as escape should be modified to use > escape string syntax ({{E'...'}}), because the default behavior of ordinary > strings is now to treat backslash as an ordinary character, per SQL standard. > This variable can be enabled to help locate code that needs to be changed. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28449) Missing escape_string_warning config
Yuming Wang created SPARK-28449: --- Summary: Missing escape_string_warning config Key: SPARK-28449 URL: https://issues.apache.org/jira/browse/SPARK-28449 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang When on, a warning is issued if a backslash ({{\}}) appears in an ordinary string literal ({{'...'}} syntax) and {{standard_conforming_strings}} is off. The default is {{on}}. Applications that wish to use backslash as escape should be modified to use escape string syntax ({{E'...'}}), because the default behavior of ordinary strings is now to treat backslash as an ordinary character, per SQL standard. This variable can be enabled to help locate code that needs to be changed. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28448) Implement ILIKE operator
Yuming Wang created SPARK-28448: --- Summary: Implement ILIKE operator Key: SPARK-28448 URL: https://issues.apache.org/jira/browse/SPARK-28448 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang The key word {{ILIKE}} can be used instead of {{LIKE}} to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used
[ https://issues.apache.org/jira/browse/SPARK-28445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28445: - Description: Python: {code} from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("int", PandasUDFType.SCALAR) def noop(x): return x spark.udf.register("udf", noop) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() {code} {code} : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, udf(count(b#1)) AS udf(count(b))#12] +- SubqueryAlias `testdata` +- Project [a#0, b#1] +- SubqueryAlias `testData` +- LocalRelation [a#0, b#1] {code} Scala: {code} spark.udf.register("udf", (input: Int) => input) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() {code} {code} ++-+ |udf((a + 1))|udf(count(b))| ++-+ |null|1| | 3|2| | 4|2| | 2|2| ++-+ {code} was: Python: {code} from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("int", PandasUDFType.SCALAR) def noop(x): return x spark.udf.register("udf", noop) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() {code} {code} : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, udf(count(b#1)) AS udf(count(b))#12] +- SubqueryAlias `testdata` +- Project [a#0, b#1] +- SubqueryAlias `testData` +- LocalRelation [a#0, b#1] {code} Scala: {code} spark.udf.register("udf", (input: Int) => input) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() {code} {code} ++-+ |udf((a + 1))|udf(count(b))| ++-+ | null| 1| | 3| 2| | 4| 2| | 2| 2| ++-+ {code} > Inconsistency between Scala and Python/Panda udfs when groupby with udf() is > used > - > > Key: SPARK-28445 > URL: https://issues.apache.org/jira/browse/SPARK-28445 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Python: > {code} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("int", PandasUDFType.SCALAR) > def noop(x): > return x > spark.udf.register("udf", noop) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > {code} > {code} > : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, > udf(count(b#1)) AS udf(count(b))#12] > +- SubqueryAlias `testdata` > +- Project [a#0, b#1] > +- SubqueryAlias `testData` > +- LocalRelation [a#0, b#1] > {code} > Scala: > {code} > spark.udf.register("udf", (input: Int) => input) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > {code} > {code} > ++-+ > |udf((a +
[jira] [Updated] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used
[ https://issues.apache.org/jira/browse/SPARK-28445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28445: - Description: Python: {code} from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("int", PandasUDFType.SCALAR) def noop(x): return x spark.udf.register("udf", noop) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() {code} {code} : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, udf(count(b#1)) AS udf(count(b))#12] +- SubqueryAlias `testdata` +- Project [a#0, b#1] +- SubqueryAlias `testData` +- LocalRelation [a#0, b#1] {code} Scala: {code} spark.udf.register("udf", (input: Int) => input) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() {code} {code} ++-+ |udf((a + 1))|udf(count(b))| ++-+ | null| 1| | 3| 2| | 4| 2| | 2| 2| ++-+ {code} was: Python: from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("int", PandasUDFType.SCALAR) def noop(x): return x spark.udf.register("udf", noop) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, udf(count(b#1)) AS udf(count(b))#12] +- SubqueryAlias `testdata` +- Project [a#0, b#1] +- SubqueryAlias `testData` +- LocalRelation [a#0, b#1] Scala: spark.udf.register("udf", (input: Int) => input) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() ++-+ |udf((a + 1))|udf(count(b))| ++-+ | null| 1| | 3| 2| | 4| 2| | 2| 2| ++-+ > Inconsistency between Scala and Python/Panda udfs when groupby with udf() is > used > - > > Key: SPARK-28445 > URL: https://issues.apache.org/jira/browse/SPARK-28445 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Python: > {code} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("int", PandasUDFType.SCALAR) > def noop(x): > return x > spark.udf.register("udf", noop) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > {code} > {code} > : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, > udf(count(b#1)) AS udf(count(b))#12] > +- SubqueryAlias `testdata` > +- Project [a#0, b#1] > +- SubqueryAlias `testData` > +- LocalRelation [a#0, b#1] > {code} > Scala: > {code} > spark.udf.register("udf", (input: Int) => input) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > {code} > {code} > ++-+ > |udf((a + 1))|udf(count(b))| > ++-+ > | null| 1| > | 3| 2| > | 4| 2| > | 2| 2| > ++-+ > {code} -- This
[jira] [Updated] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used
[ https://issues.apache.org/jira/browse/SPARK-28445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28445: Summary: Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used (was: Inconsistency between Scala and Python/Panda udfs when groupby udef() is used) > Inconsistency between Scala and Python/Panda udfs when groupby with udf() is > used > - > > Key: SPARK-28445 > URL: https://issues.apache.org/jira/browse/SPARK-28445 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Python: > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("int", PandasUDFType.SCALAR) > def noop(x): > return x > spark.udf.register("udf", noop) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, > udf(count(b#1)) AS udf(count(b))#12] > +- SubqueryAlias `testdata` > +- Project [a#0, b#1] > +- SubqueryAlias `testData` > +- LocalRelation [a#0, b#1] > Scala: > spark.udf.register("udf", (input: Int) => input) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > ++-+ > |udf((a + 1))|udf(count(b))| > ++-+ > | null| 1| > | 3| 2| > | 4| 2| > | 2| 2| > ++-+ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28447) ANSI SQL: Unicode escapes in literals
Yuming Wang created SPARK-28447: --- Summary: ANSI SQL: Unicode escapes in literals Key: SPARK-28447 URL: https://issues.apache.org/jira/browse/SPARK-28447 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang [https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/strings.sql#L19-L44] *Feature ID*: F393 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28446) Document Kafka Headers support
Lee Dongjin created SPARK-28446: --- Summary: Document Kafka Headers support Key: SPARK-28446 URL: https://issues.apache.org/jira/browse/SPARK-28446 Project: Spark Issue Type: Documentation Components: Documentation, Structured Streaming Affects Versions: 3.0.0 Reporter: Lee Dongjin This issue is a follow up of SPARK-23539. After completing SPARK-23539, the following information about the headers functionality should be noted in Structured Streaming + Kafka Integration Guide: * The requirements to use Headers functionality (i.e., Kafka version). * How to turn on the Headers functionality. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby udef() is used
[ https://issues.apache.org/jira/browse/SPARK-28445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28445: Component/s: PySpark > Inconsistency between Scala and Python/Panda udfs when groupby udef() is used > - > > Key: SPARK-28445 > URL: https://issues.apache.org/jira/browse/SPARK-28445 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Python: > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("int", PandasUDFType.SCALAR) > def noop(x): > return x > spark.udf.register("udf", noop) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, > udf(count(b#1)) AS udf(count(b))#12] > +- SubqueryAlias `testdata` > +- Project [a#0, b#1] > +- SubqueryAlias `testData` > +- LocalRelation [a#0, b#1] > Scala: > spark.udf.register("udf", (input: Int) => input) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > ++-+ > |udf((a + 1))|udf(count(b))| > ++-+ > | null| 1| > | 3| 2| > | 4| 2| > | 2| 2| > ++-+ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby udef() is used
Stavros Kontopoulos created SPARK-28445: --- Summary: Inconsistency between Scala and Python/Panda udfs when groupby udef() is used Key: SPARK-28445 URL: https://issues.apache.org/jira/browse/SPARK-28445 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Stavros Kontopoulos Python: from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("int", PandasUDFType.SCALAR) def noop(x): return x spark.udf.register("udf", noop) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, udf(count(b#1)) AS udf(count(b))#12] +- SubqueryAlias `testdata` +- Project [a#0, b#1] +- SubqueryAlias `testData` +- LocalRelation [a#0, b#1] Scala: spark.udf.register("udf", (input: Int) => input) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() ++-+ |udf((a + 1))|udf(count(b))| ++-+ | null| 1| | 3| 2| | 4| 2| | 2| 2| ++-+ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1608#comment-1608 ] Stavros Kontopoulos commented on SPARK-28444: - Hi [~patrick-winter-swisscard]. On our ci we are using v1.15 and tests pass, could you add some log output showing why pods are not created. We need to be compliant with the compatibility matrix but still we dotn have a good answer to the problem of catching up with k8s, it moves fast. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1608#comment-1608 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 11:44 AM: --- Hi [~patrick-winter-swisscard]. On our ci we are using v1.15 and tests pass, could you add some log output showing why pods are not created. We need to be compliant with the compatibility matrix but still we dont have a good answer to the problem of catching up with k8s, it moves fast. was (Author: skonto): Hi [~patrick-winter-swisscard]. On our ci we are using v1.15 and tests pass, could you add some log output showing why pods are not created. We need to be compliant with the compatibility matrix but still we dotn have a good answer to the problem of catching up with k8s, it moves fast. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog
[ https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888681#comment-16888681 ] Dongjoon Hyun commented on SPARK-23443: --- Does Glue support Spark 2.3+? As of now, AWS Glue Console shows only Spark 2.2 (Scala/Python2). BTW, - Spark 2.2 was EOL on January 2019 - Spark 2.3 will be EOL on August 2019 (next month) - Python 2.x will be EOL on January 2020. (PySpark will deprecate Python2 in this year). > Spark with Glue as external catalog > --- > > Key: SPARK-23443 > URL: https://issues.apache.org/jira/browse/SPARK-23443 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Ameen Tayyebi >Priority: Major > > AWS Glue Catalog is an external Hive metastore backed by a web service. It > allows permanent storage of catalog data for BigData use cases. > To find out more information about AWS Glue, please consult: > * AWS Glue - [https://aws.amazon.com/glue/] > * Using Glue as a Metastore catalog for Spark - > [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] > Today, the integration of Glue and Spark is through the Hive layer. Glue > implements the IMetaStore interface of Hive and for installations of Spark > that contain Hive, Glue can be used as the metastore. > The feature set that Glue supports does not align 1-1 with the set of > features that the latest version of Spark supports. For example, Glue > interface supports more advanced partition pruning that the latest version of > Hive embedded in Spark. > To enable a more natural integration with Spark and to allow leveraging > latest features of Glue, without being coupled to Hive, a direct integration > through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23758) MLlib 2.4 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888669#comment-16888669 ] Dongjoon Hyun commented on SPARK-23758: --- I moved this to 3.0.0 because open `New Feature` JIRA should go to `3.0.0`. I agree that this looks weird, but I'm not sure I can close this. Roadmap is usually managed by PMC. Hi, [~josephkb] , we already have 2.4.3 and can we close this issue? > MLlib 2.4 Roadmap > - > > Key: SPARK-23758 > URL: https://issues.apache.org/jira/browse/SPARK-23758 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Joseph K. Bradley >Priority: Major > > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Minor | optional | no | maybe | > | [5 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] > | next release | Trivial | optional | no | maybe | > | [6 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] > | (empty) | (any) | yes | no |
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888616#comment-16888616 ] Patrick Winter commented on SPARK-28444: Our company recently upgraded Kubernetes to 1.14 and since then Spark can't create PODs anymore, throwing a KubernetesClientException instead. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
Patrick Winter created SPARK-28444: -- Summary: Bump Kubernetes Client Version to 4.3.0 Key: SPARK-28444 URL: https://issues.apache.org/jira/browse/SPARK-28444 Project: Spark Issue Type: Dependency upgrade Components: Kubernetes Affects Versions: 2.4.3, 3.0.0 Reporter: Patrick Winter Spark is currently using the Kubernetes client version 4.1.2. This client does not support the current Kubernetes version 1.14, as can be seen on the [compatibility matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28284) Convert and port 'join-empty-relation.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28284. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25127 [https://github.com/apache/spark/pull/25127] > Convert and port 'join-empty-relation.sql' into UDF test base > - > > Key: SPARK-28284 > URL: https://issues.apache.org/jira/browse/SPARK-28284 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28284) Convert and port 'join-empty-relation.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28284: Assignee: Terry Kim > Convert and port 'join-empty-relation.sql' into UDF test base > - > > Key: SPARK-28284 > URL: https://issues.apache.org/jira/browse/SPARK-28284 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Terry Kim >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28440) Use TestingUtils to compare floating point values
[ https://issues.apache.org/jira/browse/SPARK-28440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28440. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/25191 > Use TestingUtils to compare floating point values > - > > Key: SPARK-28440 > URL: https://issues.apache.org/jira/browse/SPARK-28440 > Project: Spark > Issue Type: Improvement > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27707) Performance issue using explode
[ https://issues.apache.org/jira/browse/SPARK-27707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27707: - Assignee: Liang-Chi Hsieh > Performance issue using explode > --- > > Key: SPARK-27707 > URL: https://issues.apache.org/jira/browse/SPARK-27707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ohad Raviv >Assignee: Liang-Chi Hsieh >Priority: Major > > this is a corner case of SPARK-21657. > we have a case where we want to explode array inside a struct and also keep > some other columns of the struct. we again encounter a huge performance issue. > reconstruction code: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > }.toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > a workaround is projecting before the explode: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > }.toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .withColumn("col1", $"st.col") > .selectExpr("col", "col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > in this case the optimization done in SPARK-21657: > {code} > // prune unrequired references > case p @ Project(_, g: Generate) if p.references != g.outputSet => > val requiredAttrs = p.references -- g.producedAttributes ++ > g.generator.references > val newChild = prunedChild(g.child, requiredAttrs) > val unrequired = g.generator.references -- p.references > val unrequiredIndices = newChild.output.zipWithIndex.filter(t => > unrequired.contains(t._1)) > .map(_._2) > p.copy(child = g.copy(child = newChild, unrequiredChildIndex = > unrequiredIndices)) > {code} > doesn't work because `p.references` has whole the `st` struct as reference > and not just the projected field. > this causes the entire struct including the huge array field to get > duplicated as the number of array elements. > I know this is kind of a corner case but was really non trivial to > understand.. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27707) Prune unnecessary nested fields from Generate
[ https://issues.apache.org/jira/browse/SPARK-27707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27707: -- Summary: Prune unnecessary nested fields from Generate (was: Performance issue using explode) > Prune unnecessary nested fields from Generate > - > > Key: SPARK-27707 > URL: https://issues.apache.org/jira/browse/SPARK-27707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ohad Raviv >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > this is a corner case of SPARK-21657. > we have a case where we want to explode array inside a struct and also keep > some other columns of the struct. we again encounter a huge performance issue. > reconstruction code: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > }.toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > a workaround is projecting before the explode: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > }.toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .withColumn("col1", $"st.col") > .selectExpr("col", "col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > in this case the optimization done in SPARK-21657: > {code} > // prune unrequired references > case p @ Project(_, g: Generate) if p.references != g.outputSet => > val requiredAttrs = p.references -- g.producedAttributes ++ > g.generator.references > val newChild = prunedChild(g.child, requiredAttrs) > val unrequired = g.generator.references -- p.references > val unrequiredIndices = newChild.output.zipWithIndex.filter(t => > unrequired.contains(t._1)) > .map(_._2) > p.copy(child = g.copy(child = newChild, unrequiredChildIndex = > unrequiredIndices)) > {code} > doesn't work because `p.references` has whole the `st` struct as reference > and not just the projected field. > this causes the entire struct including the huge array field to get > duplicated as the number of array elements. > I know this is kind of a corner case but was really non trivial to > understand.. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27707) Performance issue using explode
[ https://issues.apache.org/jira/browse/SPARK-27707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27707. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24637 [https://github.com/apache/spark/pull/24637] > Performance issue using explode > --- > > Key: SPARK-27707 > URL: https://issues.apache.org/jira/browse/SPARK-27707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ohad Raviv >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > this is a corner case of SPARK-21657. > we have a case where we want to explode array inside a struct and also keep > some other columns of the struct. we again encounter a huge performance issue. > reconstruction code: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > }.toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > a workaround is projecting before the explode: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > }.toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .withColumn("col1", $"st.col") > .selectExpr("col", "col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > in this case the optimization done in SPARK-21657: > {code} > // prune unrequired references > case p @ Project(_, g: Generate) if p.references != g.outputSet => > val requiredAttrs = p.references -- g.producedAttributes ++ > g.generator.references > val newChild = prunedChild(g.child, requiredAttrs) > val unrequired = g.generator.references -- p.references > val unrequiredIndices = newChild.output.zipWithIndex.filter(t => > unrequired.contains(t._1)) > .map(_._2) > p.copy(child = g.copy(child = newChild, unrequiredChildIndex = > unrequiredIndices)) > {code} > doesn't work because `p.references` has whole the `st` struct as reference > and not just the projected field. > this causes the entire struct including the huge array field to get > duplicated as the number of array elements. > I know this is kind of a corner case but was really non trivial to > understand.. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org