[jira] [Created] (SPARK-33599) Group exception messages in catalyst/analysis
Allison Wang created SPARK-33599: Summary: Group exception messages in catalyst/analysis Key: SPARK-33599 URL: https://issues.apache.org/jira/browse/SPARK-33599 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Allison Wang Group all exception messages in `catalyst/analysis`. || Filename|| Count || | Analyzer.scala | 1 | | CheckAnalysis.scala | 1 | | FunctionRegistry.scala | 5 | | ResolveCatalogs.scala | 1 | | ResolveHints.scala | 1 | | ResolveSessionCatalog.scala | 12 | | package.scala | 2 | | unresolved.scala| 43 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33542) Group exception messages in catalyst/catalog
[ https://issues.apache.org/jira/browse/SPARK-33542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33542: - Description: Group all exception messages in `catalyst/catalog`. ||Filename||Count|| |ExternalCatalog.scala|4| |GlobalTempViewManager.scala|1| |InMemoryCatalog.scala|18| |SessionCatalog.scala|17| |functionResources.scala|1| |interface.scala|4| > Group exception messages in catalyst/catalog > > > Key: SPARK-33542 > URL: https://issues.apache.org/jira/browse/SPARK-33542 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > > Group all exception messages in `catalyst/catalog`. > ||Filename||Count|| > |ExternalCatalog.scala|4| > |GlobalTempViewManager.scala|1| > |InMemoryCatalog.scala|18| > |SessionCatalog.scala|17| > |functionResources.scala|1| > |interface.scala|4| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33597) Support REGEXP_LIKE for consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33597: Assignee: Apache Spark > Support REGEXP_LIKE for consistent with mainstream databases > > > Key: SPARK-33597 > URL: https://issues.apache.org/jira/browse/SPARK-33597 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > There are a lot of mainstream databases support regex function REGEXP_LIKE. > Currently, Spark supports RLike and we just need add a new alias REGEXP_LIKE > for it. > *Oracle*:https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 > *Presto*:https://prestodb.io/docs/current/functions/regexp.html > *Vertica*:https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_5 > *Snowflake*:https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33597) Support REGEXP_LIKE for consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33597: Assignee: (was: Apache Spark) > Support REGEXP_LIKE for consistent with mainstream databases > > > Key: SPARK-33597 > URL: https://issues.apache.org/jira/browse/SPARK-33597 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > There are a lot of mainstream databases support regex function REGEXP_LIKE. > Currently, Spark supports RLike and we just need add a new alias REGEXP_LIKE > for it. > *Oracle*:https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 > *Presto*:https://prestodb.io/docs/current/functions/regexp.html > *Vertica*:https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_5 > *Snowflake*:https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33597) Support REGEXP_LIKE for consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240530#comment-17240530 ] Apache Spark commented on SPARK-33597: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30543 > Support REGEXP_LIKE for consistent with mainstream databases > > > Key: SPARK-33597 > URL: https://issues.apache.org/jira/browse/SPARK-33597 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > There are a lot of mainstream databases support regex function REGEXP_LIKE. > Currently, Spark supports RLike and we just need add a new alias REGEXP_LIKE > for it. > *Oracle*:https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 > *Presto*:https://prestodb.io/docs/current/functions/regexp.html > *Vertica*:https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_5 > *Snowflake*:https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33597) Support REGEXP_LIKE for consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240531#comment-17240531 ] Apache Spark commented on SPARK-33597: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30543 > Support REGEXP_LIKE for consistent with mainstream databases > > > Key: SPARK-33597 > URL: https://issues.apache.org/jira/browse/SPARK-33597 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > There are a lot of mainstream databases support regex function REGEXP_LIKE. > Currently, Spark supports RLike and we just need add a new alias REGEXP_LIKE > for it. > *Oracle*:https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 > *Presto*:https://prestodb.io/docs/current/functions/regexp.html > *Vertica*:https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_5 > *Snowflake*:https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33598) Support Java Class with circular references
[ https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jacklzg updated SPARK-33598: Description: If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders. For example, with protobuf class, there is a reference with Descriptor, there is no way to build a dataset from the protobuf class. >From this line ``` {quote} {code:java} Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{code} {quote} ``` It will throw out immediately ``` {quote}Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class com.google.protobuf.Descriptors$Descriptor {quote} ``` Can we add a parameter, for example, ``` {code:java} Encoders.bean(Class clas, List fieldsToIgnore);{code} or ``` {code:java} Encoders.bean(Class clas, boolean skipCircularRefField);{code} was: If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders. For example, with protobuf class, there is a reference with Descriptor, there is no way to build a dataset from the protobuf class. >From this line ``` Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class); ``` It will throw out immediately ``` Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class com.google.protobuf.Descriptors$Descriptor ``` Can we add a parameter, for example, ``` Encoders.bean(Class clas, List fieldsToIgnore); or ``` Encoders.bean(Class clas, boolean skipCircularRefField); > Support Java Class with circular references > --- > > Key: SPARK-33598 > URL: https://issues.apache.org/jira/browse/SPARK-33598 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.4.7 >Reporter: jacklzg >Priority: Minor > > If the target Java data class has a circular reference, Spark will fail fast > from creating the Dataset or running Encoders. > > For example, with protobuf class, there is a reference with Descriptor, there > is no way to build a dataset from the protobuf class. > From this line > ``` > {quote} > {code:java} > Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{code} > {quote} > ``` > It will throw out immediately > ``` > {quote}Exception in thread "main" java.lang.UnsupportedOperationException: > Cannot have circular references in bean class, but got the circular reference > of class class com.google.protobuf.Descriptors$Descriptor > {quote} > ``` > Can we add a parameter, for example, > ``` > {code:java} > Encoders.bean(Class clas, List fieldsToIgnore);{code} > > or > ``` > {code:java} > Encoders.bean(Class clas, boolean skipCircularRefField);{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33598) Support Java Class with circular references
[ https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jacklzg updated SPARK-33598: Description: If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders. For example, with protobuf class, there is a reference with Descriptor, there is no way to build a dataset from the protobuf class. >From this line {color:#7a869a}Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{color} It will throw out immediately {quote}Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class com.google.protobuf.Descriptors$Descriptor {quote} Can we add a parameter, for example, {code:java} Encoders.bean(Class clas, List fieldsToIgnore);{code} or {code:java} Encoders.bean(Class clas, boolean skipCircularRefField);{code} was: If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders. For example, with protobuf class, there is a reference with Descriptor, there is no way to build a dataset from the protobuf class. >From this line ``` {quote} {code:java} Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{code} {quote} ``` It will throw out immediately ``` {quote}Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class com.google.protobuf.Descriptors$Descriptor {quote} ``` Can we add a parameter, for example, ``` {code:java} Encoders.bean(Class clas, List fieldsToIgnore);{code} or ``` {code:java} Encoders.bean(Class clas, boolean skipCircularRefField);{code} > Support Java Class with circular references > --- > > Key: SPARK-33598 > URL: https://issues.apache.org/jira/browse/SPARK-33598 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.4.7 >Reporter: jacklzg >Priority: Minor > > If the target Java data class has a circular reference, Spark will fail fast > from creating the Dataset or running Encoders. > > For example, with protobuf class, there is a reference with Descriptor, there > is no way to build a dataset from the protobuf class. > From this line > {color:#7a869a}Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{color} > > It will throw out immediately > > {quote}Exception in thread "main" java.lang.UnsupportedOperationException: > Cannot have circular references in bean class, but got the circular reference > of class class com.google.protobuf.Descriptors$Descriptor > {quote} > > Can we add a parameter, for example, > > {code:java} > Encoders.bean(Class clas, List fieldsToIgnore);{code} > > or > > {code:java} > Encoders.bean(Class clas, boolean skipCircularRefField);{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33598) Support Java Class with circular references
jacklzg created SPARK-33598: --- Summary: Support Java Class with circular references Key: SPARK-33598 URL: https://issues.apache.org/jira/browse/SPARK-33598 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 2.4.7 Reporter: jacklzg If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders. For example, with protobuf class, there is a reference with Descriptor, there is no way to build a dataset from the protobuf class. >From this line ``` Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class); ``` It will throw out immediately ``` Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class com.google.protobuf.Descriptors$Descriptor ``` Can we add a parameter, for example, ``` Encoders.bean(Class clas, List fieldsToIgnore); or ``` Encoders.bean(Class clas, boolean skipCircularRefField); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33597) Support REGEXP_LIKE for consistent with mainstream databases
jiaan.geng created SPARK-33597: -- Summary: Support REGEXP_LIKE for consistent with mainstream databases Key: SPARK-33597 URL: https://issues.apache.org/jira/browse/SPARK-33597 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng There are a lot of mainstream databases support regex function REGEXP_LIKE. Currently, Spark supports RLike and we just need add a new alias REGEXP_LIKE for it. *Oracle*:https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 *Presto*:https://prestodb.io/docs/current/functions/regexp.html *Vertica*:https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_5 *Snowflake*:https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33596) NPE when there is no EventTime
Genmao Yu created SPARK-33596: - Summary: NPE when there is no EventTime Key: SPARK-33596 URL: https://issues.apache.org/jira/browse/SPARK-33596 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Genmao Yu We parse the process timestamp at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L153, but will throw NPE when there is no event time metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28325) Support ANSI SQL:SIMILAR TO ... ESCAPE syntax
[ https://issues.apache.org/jira/browse/SPARK-28325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-28325: --- Description: {code:java} ::= ::= [ NOT ] SIMILAR TO [ ESCAPE ] ::= ::= | ::= | ::= | | | | ::= [ ] ::= [ ] ::= ::= ::= | | | ::= | ::= !! See the Syntax Rules. 494 Foundation (SQL/Foundation) CD 9075-2:201?(E) 8.6 ::= !! See the Syntax Rules. ::= | ... | ... | ... ... ::= ::= ::= | | ::= {code} Examples: {code} SELECT 'abc' RLIKE '%(b|d)%'; // false SELECT 'abc' SIMILAR TO '%(b|d)%' // true SELECT 'abc' RLIKE '(b|c)%'; // false SELECT 'abc' SIMILAR TO '(b|c)%'; // false{code} Currently, the following DBMSs support the syntax: * PostgreSQL:[https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP] * Redshift: [https://docs.aws.amazon.com/redshift/latest/dg/pattern-matching-conditions-similar-to.html] * teradata:[https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/fwqgzZuhAvOXLKUu0kUfJQ] was: {code:java} ::= ::= [ NOT ] SIMILAR TO [ ESCAPE ] ::= ::= | ::= | ::= | | | | ::= [ ] ::= [ ] ::= ::= ::= | | | ::= | ::= !! See the Syntax Rules. 494 Foundation (SQL/Foundation) CD 9075-2:201?(E) 8.6 ::= !! See the Syntax Rules. ::= | ... | ... | ... ... ::= ::= ::= | | ::= {code} Examples: {code} SELECT 'abc' RLIKE '%(b|d)%'; // false SELECT 'abc' SIMILAR TO '%(b|d)%' // true SELECT 'abc' RLIKE '(b|c)%'; // false SELECT 'abc' SIMILAR TO '(b|c)%'; // false{code} Currently, the following DBMSs support the syntax: * PostgreSQL:[https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP] * Redshift: [https://docs.aws.amazon.com/redshift/latest/dg/pattern-matching-conditions-similar-to.html] > Support ANSI SQL:SIMILAR TO ... ESCAPE syntax > - > > Key: SPARK-28325 > URL: https://issues.apache.org/jira/browse/SPARK-28325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > {code:java} > ::= > > ::= > [ NOT ] SIMILAR TO [ ESCAPE ] > ::= > > ::= > > | > ::= > > | > ::= > > | > | > | > | > ::= > [ ] > ::= > [ ] > ::= > > ::= > > ::= > > | > | > | > ::= > > | > ::= > !! See the Syntax Rules. > 494 Foundation (SQL/Foundation) > CD 9075-2:201?(E) > 8.6 > ::= > !! See the Syntax Rules. > ::= > > | ... > | ... > | ... > ... > ::= > > ::= > > ::= > > | > | bracket> > ::= > {code} > > Examples: > {code} > SELECT 'abc' RLIKE '%(b|d)%'; // false > SELECT 'abc' SIMILAR TO '%(b|d)%' // true > SELECT 'abc' RLIKE '(b|c)%'; // false > SELECT 'abc' SIMILAR TO '(b|c)%'; // false{code} > > Currently, the following DBMSs support the syntax: > * > PostgreSQL:[https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP] > * Redshift: > [https://docs.aws.amazon.com/redshift/latest/dg/pattern-matching-conditions-similar-to.html] > * > teradata:[https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/fwqgzZuhAvOXLKUu0kUfJQ] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28506) not handling usage of group function and window function at some conditions
[ https://issues.apache.org/jira/browse/SPARK-28506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240505#comment-17240505 ] jiaan.geng commented on SPARK-28506: I runed the similar sql show below: {code:java} SELECT rank() OVER (ORDER BY salary), count(*) FROM basic_pays GROUP BY 1 > ERROR: window functions are not allowed in GROUP BY LINE 2: rank() OVER (ORDER BY salary), ^ > Time: 0.011s {code} It seems isn't consistent with your description. > not handling usage of group function and window function at some conditions > --- > > Key: SPARK-28506 > URL: https://issues.apache.org/jira/browse/SPARK-28506 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dylan Guedes >Priority: Major > > Hi, > looks like SparkSQL is not able to handle this query: > {code:sql}SELECT rank() OVER (ORDER BY 1), count(*) FROM empsalary GROUP BY > 1;{code} > PgSQL, on the other hand, does. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33595) Run PySpark coverage only in the master branch
[ https://issues.apache.org/jira/browse/SPARK-33595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33595. -- Resolution: Invalid Resolved by setting Jenkins environment variables. > Run PySpark coverage only in the master branch > -- > > Key: SPARK-33595 > URL: https://issues.apache.org/jira/browse/SPARK-33595 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently PySpark test coverage runs in branch-3.0 > (https://github.com/apache/spark/pull/23117#issuecomment-735557536). We > should only run this in the master branch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29638) Spark handles 'NaN' as 0 in sums
[ https://issues.apache.org/jira/browse/SPARK-29638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240498#comment-17240498 ] jiaan.geng edited comment on SPARK-29638 at 11/30/20, 6:18 AM: --- I runed the sql below in PgSQL {code:java} SELECT a, b, SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) FROM (VALUES(1,1),(2,2),(3,(cast('NaN' as int))),(4,3),(5,4)) t(a,b) > ERROR: invalid input syntax for type integer: "NaN" LINE 3: FROM (VALUES(1,1),(2,2),(3,(cast('NaN' as int))),(4,3),(5,4)... ^ > Time: 0.011s {code} [~DylanGuedes] Could you tell me more ? was (Author: beliefer): {code:java} SELECT a, b, SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) FROM (VALUES(1,1),(2,2),(3,(cast('NaN' as int))),(4,3),(5,4)) t(a,b) > ERROR: invalid input syntax for type integer: "NaN" LINE 3: FROM (VALUES(1,1),(2,2),(3,(cast('NaN' as int))),(4,3),(5,4)... ^ > Time: 0.011s {code} [~DylanGuedes] Could you tell me more ? > Spark handles 'NaN' as 0 in sums > > > Key: SPARK-29638 > URL: https://issues.apache.org/jira/browse/SPARK-29638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. > PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = > 'NaN' > I experienced this with the query below: > {code:sql} > SELECT a, b, >SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) > FROM (VALUES(1,1),(2,2),(3,(cast('nan' as int))),(4,3),(5,4)) t(a,b); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29638) Spark handles 'NaN' as 0 in sums
[ https://issues.apache.org/jira/browse/SPARK-29638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240498#comment-17240498 ] jiaan.geng commented on SPARK-29638: {code:java} SELECT a, b, SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) FROM (VALUES(1,1),(2,2),(3,(cast('NaN' as int))),(4,3),(5,4)) t(a,b) > ERROR: invalid input syntax for type integer: "NaN" LINE 3: FROM (VALUES(1,1),(2,2),(3,(cast('NaN' as int))),(4,3),(5,4)... ^ > Time: 0.011s {code} [~DylanGuedes] Could you tell me more ? > Spark handles 'NaN' as 0 in sums > > > Key: SPARK-29638 > URL: https://issues.apache.org/jira/browse/SPARK-29638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. > PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = > 'NaN' > I experienced this with the query below: > {code:sql} > SELECT a, b, >SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) > FROM (VALUES(1,1),(2,2),(3,(cast('nan' as int))),(4,3),(5,4)) t(a,b); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33595) Run PySpark coverage only in the master branch
[ https://issues.apache.org/jira/browse/SPARK-33595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33595: - Issue Type: Test (was: Improvement) > Run PySpark coverage only in the master branch > -- > > Key: SPARK-33595 > URL: https://issues.apache.org/jira/browse/SPARK-33595 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently PySpark test coverage runs in branch-3.0 > (https://github.com/apache/spark/pull/23117#issuecomment-735557536). We > should only run this in the master branch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33595) Run PySpark coverage only in the master branch
Hyukjin Kwon created SPARK-33595: Summary: Run PySpark coverage only in the master branch Key: SPARK-33595 URL: https://issues.apache.org/jira/browse/SPARK-33595 Project: Spark Issue Type: Improvement Components: Project Infra, PySpark Affects Versions: 3.0.1, 3.1.0 Reporter: Hyukjin Kwon Currently PySpark test coverage runs in branch-3.0 (https://github.com/apache/spark/pull/23117#issuecomment-735557536). We should only run this in the master branch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33448) Support CACHE/UNCACHE TABLE for v2 tables
[ https://issues.apache.org/jira/browse/SPARK-33448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33448. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30403 [https://github.com/apache/spark/pull/30403] > Support CACHE/UNCACHE TABLE for v2 tables > - > > Key: SPARK-33448 > URL: https://issues.apache.org/jira/browse/SPARK-33448 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > Migrate CACHE/UNCACHE TABLE to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33448) Support CACHE/UNCACHE TABLE for v2 tables
[ https://issues.apache.org/jira/browse/SPARK-33448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33448: --- Assignee: Terry Kim > Support CACHE/UNCACHE TABLE for v2 tables > - > > Key: SPARK-33448 > URL: https://issues.apache.org/jira/browse/SPARK-33448 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > Migrate CACHE/UNCACHE TABLE to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32976) Support column list in INSERT statement
[ https://issues.apache.org/jira/browse/SPARK-32976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32976. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29893 [https://github.com/apache/spark/pull/29893] > Support column list in INSERT statement > --- > > Key: SPARK-32976 > URL: https://issues.apache.org/jira/browse/SPARK-32976 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > INSERT currently does not support named column lists. > {{INSERT INTO (col1, col2,…) VALUES( 'val1', 'val2', … )}} > Note, we assume the column list contains all the column names. Issue an > exception if the list is not complete. The column order could be different > from the column order defined in the table definition. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32976) Support column list in INSERT statement
[ https://issues.apache.org/jira/browse/SPARK-32976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32976: --- Assignee: Kent Yao > Support column list in INSERT statement > --- > > Key: SPARK-32976 > URL: https://issues.apache.org/jira/browse/SPARK-32976 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Assignee: Kent Yao >Priority: Major > > INSERT currently does not support named column lists. > {{INSERT INTO (col1, col2,…) VALUES( 'val1', 'val2', … )}} > Note, we assume the column list contains all the column names. Issue an > exception if the list is not complete. The column order could be different > from the column order defined in the table definition. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33567) DSv2: Use callback instead of passing Spark session and v2 relation for refreshing cache
[ https://issues.apache.org/jira/browse/SPARK-33567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33567: --- Assignee: Chao Sun > DSv2: Use callback instead of passing Spark session and v2 relation for > refreshing cache > > > Key: SPARK-33567 > URL: https://issues.apache.org/jira/browse/SPARK-33567 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > As discussed [https://github.com/apache/spark/pull/30429], it's better to not > pass Spark session and DataSourceV2Relation through Spark plans. Instead we > can use a callback which makes the interface cleaner. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33567) DSv2: Use callback instead of passing Spark session and v2 relation for refreshing cache
[ https://issues.apache.org/jira/browse/SPARK-33567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33567. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30491 [https://github.com/apache/spark/pull/30491] > DSv2: Use callback instead of passing Spark session and v2 relation for > refreshing cache > > > Key: SPARK-33567 > URL: https://issues.apache.org/jira/browse/SPARK-33567 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > As discussed [https://github.com/apache/spark/pull/30429], it's better to not > pass Spark session and DataSourceV2Relation through Spark plans. Instead we > can use a callback which makes the interface cleaner. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33592) Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-33592: --- Summary: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading (was: Pyspark ML Validator writer may lost params in estimatorParamMaps) > Pyspark ML Validator params in estimatorParamMaps may be lost after saving > and reloading > > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit in Pyspark has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33594) Forbid binary type as partition column
[ https://issues.apache.org/jira/browse/SPARK-33594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33594: Assignee: (was: Apache Spark) > Forbid binary type as partition column > -- > > Key: SPARK-33594 > URL: https://issues.apache.org/jira/browse/SPARK-33594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Forbid binary type as partition column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33594) Forbid binary type as partition column
[ https://issues.apache.org/jira/browse/SPARK-33594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33594: Assignee: Apache Spark > Forbid binary type as partition column > -- > > Key: SPARK-33594 > URL: https://issues.apache.org/jira/browse/SPARK-33594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Forbid binary type as partition column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33594) Forbid binary type as partition column
[ https://issues.apache.org/jira/browse/SPARK-33594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240467#comment-17240467 ] Apache Spark commented on SPARK-33594: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30542 > Forbid binary type as partition column > -- > > Key: SPARK-33594 > URL: https://issues.apache.org/jira/browse/SPARK-33594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Forbid binary type as partition column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33594) Forbid binary type as partition column
angerszhu created SPARK-33594: - Summary: Forbid binary type as partition column Key: SPARK-33594 URL: https://issues.apache.org/jira/browse/SPARK-33594 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: angerszhu Forbid binary type as partition column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function
[ https://issues.apache.org/jira/browse/SPARK-28646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28646: Assignee: Apache Spark > Allow usage of `count` only for parameterless aggregate function > > > Key: SPARK-28646 > URL: https://issues.apache.org/jira/browse/SPARK-28646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dylan Guedes >Assignee: Apache Spark >Priority: Major > > Currently, Spark allows calls to `count` even for non parameterless aggregate > function. For example, the following query actually works: > {code:sql}SELECT count() OVER () FROM tenk1;{code} > In PgSQL, on the other hand, the following error is thrown: > {code:sql}ERROR: count(*) must be used to call a parameterless aggregate > function{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function
[ https://issues.apache.org/jira/browse/SPARK-28646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240455#comment-17240455 ] Apache Spark commented on SPARK-28646: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30541 > Allow usage of `count` only for parameterless aggregate function > > > Key: SPARK-28646 > URL: https://issues.apache.org/jira/browse/SPARK-28646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark allows calls to `count` even for non parameterless aggregate > function. For example, the following query actually works: > {code:sql}SELECT count() OVER () FROM tenk1;{code} > In PgSQL, on the other hand, the following error is thrown: > {code:sql}ERROR: count(*) must be used to call a parameterless aggregate > function{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240456#comment-17240456 ] Hyukjin Kwon commented on SPARK-33571: -- cc [~maxgekk] FYI > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function
[ https://issues.apache.org/jira/browse/SPARK-28646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28646: Assignee: Apache Spark > Allow usage of `count` only for parameterless aggregate function > > > Key: SPARK-28646 > URL: https://issues.apache.org/jira/browse/SPARK-28646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dylan Guedes >Assignee: Apache Spark >Priority: Major > > Currently, Spark allows calls to `count` even for non parameterless aggregate > function. For example, the following query actually works: > {code:sql}SELECT count() OVER () FROM tenk1;{code} > In PgSQL, on the other hand, the following error is thrown: > {code:sql}ERROR: count(*) must be used to call a parameterless aggregate > function{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function
[ https://issues.apache.org/jira/browse/SPARK-28646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28646: Assignee: (was: Apache Spark) > Allow usage of `count` only for parameterless aggregate function > > > Key: SPARK-28646 > URL: https://issues.apache.org/jira/browse/SPARK-28646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark allows calls to `count` even for non parameterless aggregate > function. For example, the following query actually works: > {code:sql}SELECT count() OVER () FROM tenk1;{code} > In PgSQL, on the other hand, the following error is thrown: > {code:sql}ERROR: count(*) must be used to call a parameterless aggregate > function{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33593) Parquet vector reader incorrect with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240454#comment-17240454 ] angerszhu commented on SPARK-33593: --- raise a pr soon > Parquet vector reader incorrect with binary partition value > --- > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33593) Parquet vector reader incorrect with binary partition value
angerszhu created SPARK-33593: - Summary: Parquet vector reader incorrect with binary partition value Key: SPARK-33593 URL: https://issues.apache.org/jira/browse/SPARK-33593 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: angerszhu {code:java} test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) | USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid
[ https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240453#comment-17240453 ] Apache Spark commented on SPARK-33572: -- User 'waitinfuture' has created a pull request for this issue: https://github.com/apache/spark/pull/30516 > Datetime building should fail if the year, month, ..., second combination is > invalid > > > Key: SPARK-33572 > URL: https://issues.apache.org/jira/browse/SPARK-33572 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: zhoukeyong >Priority: Major > > Datetime building should fail if the year, month, ..., second combination is > invalid, when ANSI mode is enabled. This patch should update MakeDate, > MakeTimestamp and MakeInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid
[ https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33572: Assignee: (was: Apache Spark) > Datetime building should fail if the year, month, ..., second combination is > invalid > > > Key: SPARK-33572 > URL: https://issues.apache.org/jira/browse/SPARK-33572 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: zhoukeyong >Priority: Major > > Datetime building should fail if the year, month, ..., second combination is > invalid, when ANSI mode is enabled. This patch should update MakeDate, > MakeTimestamp and MakeInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid
[ https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240452#comment-17240452 ] Apache Spark commented on SPARK-33572: -- User 'waitinfuture' has created a pull request for this issue: https://github.com/apache/spark/pull/30516 > Datetime building should fail if the year, month, ..., second combination is > invalid > > > Key: SPARK-33572 > URL: https://issues.apache.org/jira/browse/SPARK-33572 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: zhoukeyong >Priority: Major > > Datetime building should fail if the year, month, ..., second combination is > invalid, when ANSI mode is enabled. This patch should update MakeDate, > MakeTimestamp and MakeInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid
[ https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33572: Assignee: Apache Spark > Datetime building should fail if the year, month, ..., second combination is > invalid > > > Key: SPARK-33572 > URL: https://issues.apache.org/jira/browse/SPARK-33572 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: zhoukeyong >Assignee: Apache Spark >Priority: Major > > Datetime building should fail if the year, month, ..., second combination is > invalid, when ANSI mode is enabled. This patch should update MakeDate, > MakeTimestamp and MakeInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid
[ https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33498: - Comment: was deleted (was: User 'leanken' has created a pull request for this issue: https://github.com/apache/spark/pull/30540) > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid > -- > > Key: SPARK-33498 > URL: https://issues.apache.org/jira/browse/SPARK-33498 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid, when ANSI mode is enable. This patch should update > GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.
[ https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240451#comment-17240451 ] Hyukjin Kwon commented on SPARK-33576: -- - Are you able to provide the full reproducer with smaller data? - Does that happen consistently which ever you code run that uses pandas / Arrow? - If this is indeterministically reproduced, Is it dependent on the codes or data? > PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC > message: negative bodyLength'. > - > > Key: SPARK-33576 > URL: https://issues.apache.org/jira/browse/SPARK-33576 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 > Environment: Databricks runtime 7.3 > Spakr 3.0.1 > Scala 2.12 >Reporter: Darshat >Priority: Major > > Hello, > We are using Databricks on Azure to process large amount of ecommerce data. > Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12. > During processing, there is a groupby operation on the DataFrame that > consistently gets an exception of this type: > > {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: > Invalid IPC message: negative bodyLength'. Full traceback below: Traceback > (most recent call last): File "/databricks/spark/python/pyspark/worker.py", > line 654, in main process() File > "/databricks/spark/python/pyspark/worker.py", line 646, in process > serializer.dump_stream(out_iter, outfile) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in > dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in > dump_stream for batch in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in > init_stream_yield_batches for series in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in > load_stream for batch in batches: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in > load_stream for batch in batches: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in > load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in > __iter__ File "pyarrow/ipc.pxi", line 432, in > pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", > line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative > bodyLength{color} > > Code that causes this: > {color:#ff}x = df.groupby('providerid').apply(domain_features){color} > {color:#ff}display(x.info()){color} > Dataframe size - 22 million rows, 31 columns > One of the columns is a string ('providerid') on which we do a groupby > followed by an apply operation. There are 3 distinct provider ids in this > set. While trying to enumerate/count the results, we get this exception. > We've put all possible checks in the code for null values, or corrupt data > and we are not able to track this to application level code. I hope we can > get some help troubleshooting this as this is a blocker for rolling out at > scale. > The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other > settings that could be useful. > Hope to get some insights into the problem. > Thanks, > Darshat Shah -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid
[ https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240449#comment-17240449 ] Apache Spark commented on SPARK-33498: -- User 'leanken' has created a pull request for this issue: https://github.com/apache/spark/pull/30540 > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid > -- > > Key: SPARK-33498 > URL: https://issues.apache.org/jira/browse/SPARK-33498 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid, when ANSI mode is enable. This patch should update > GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240435#comment-17240435 ] Apache Spark commented on SPARK-33592: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/30539 > Pyspark ML Validator writer may lost params in estimatorParamMaps > - > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit in Pyspark has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240436#comment-17240436 ] Apache Spark commented on SPARK-33592: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/30539 > Pyspark ML Validator writer may lost params in estimatorParamMaps > - > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit in Pyspark has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33592: Assignee: Weichen Xu (was: Apache Spark) > Pyspark ML Validator writer may lost params in estimatorParamMaps > - > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit in Pyspark has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33592: Assignee: Apache Spark (was: Weichen Xu) > Pyspark ML Validator writer may lost params in estimatorParamMaps > - > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit in Pyspark has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33582) Partition predicate pushdown into Hive metastore support not-equals
[ https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33582. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30534 [https://github.com/apache/spark/pull/30534] > Partition predicate pushdown into Hive metastore support not-equals > --- > > Key: SPARK-33582 > URL: https://issues.apache.org/jira/browse/SPARK-33582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > > https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207 > https://issues.apache.org/jira/browse/HIVE-2702 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33589) Close opened session if the initialization fails
[ https://issues.apache.org/jira/browse/SPARK-33589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33589: Assignee: Yuming Wang > Close opened session if the initialization fails > > > Key: SPARK-33589 > URL: https://issues.apache.org/jira/browse/SPARK-33589 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33589) Close opened session if the initialization fails
[ https://issues.apache.org/jira/browse/SPARK-33589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33589. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30536 [https://github.com/apache/spark/pull/30536] > Close opened session if the initialization fails > > > Key: SPARK-33589 > URL: https://issues.apache.org/jira/browse/SPARK-33589 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33578) enableHiveSupport is invalid after sparkContext that without hive support created
[ https://issues.apache.org/jira/browse/SPARK-33578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33578. -- Resolution: Won't Fix > enableHiveSupport is invalid after sparkContext that without hive support > created > -- > > Key: SPARK-33578 > URL: https://issues.apache.org/jira/browse/SPARK-33578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: steven zhang >Priority: Minor > Fix For: 3.1.0 > > > reproduce as follow code: > SparkConf sparkConf = new SparkConf().setAppName("hello"); > sparkConf.set("spark.master", "local"); > JavaSparkContext jssc = new JavaSparkContext(sparkConf); > spark = SparkSession.builder() > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("hive.exec.dynamici.partition", > true).config("hive.exec.dynamic.partition.mode", "nonstrict") > .config("hive.metastore.uris", "thrift://hivemetastore:9083") > .enableHiveSupport() > .master("local") > .getOrCreate(); > spark.sql("select * from hudi_db.hudi_test_order").show(); > > it will produce follow Exception > AssertionError: assertion failed: No plan for HiveTableRelation > [`hudi_db`.`hudi_test_order` … (at current master branch) > org.apache.spark.sql.AnalysisException: Table or view not found: > `hudi_db`.`hudi_test_order`; (at spark v2.4.4) > > the reason is SparkContext#getOrCreate(SparkConf) will return activeContext > that include previous spark config if it has > but the input SparkConf is the newest which include previous spark config and > options. > enableHiveSupport will set options (“spark.sql.catalogImplementation", > "hive”) when spark session created it will miss this conf > SharedState load conf from sparkContext and will miss hive catalog -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33589) Close opened session if the initialization fails
[ https://issues.apache.org/jira/browse/SPARK-33589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33589: Summary: Close opened session if the initialization fails (was: Add try catch when opening session) > Close opened session if the initialization fails > > > Key: SPARK-33589 > URL: https://issues.apache.org/jira/browse/SPARK-33589 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-33592: --- Description: Two typical cases to reproduce it: (1) {code:python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code:python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit in Pyspark has this issue. was: Two typical cases to reproduce it: (1) {code:python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code:python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit has this issue. > Pyspark ML Validator writer may lost params in estimatorParamMaps > - > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit in Pyspark has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-33592: -- Assignee: Weichen Xu > Pyspark ML Validator writer may lost params in estimatorParamMaps > - > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-33592: --- Description: Two typical cases to reproduce it: (1) {code:python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code:python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit has this issue. was: Two typical cases to reproduce it: (1) {code: python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code: python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit has this issue. > Pyspark ML Validator writer may lost params in estimatorParamMaps > - > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Weichen Xu >Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, >estimatorParamMaps=paramGrid, >evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33592) Pyspark ML Validator writer may lost params in estimatorParamMaps
Weichen Xu created SPARK-33592: -- Summary: Pyspark ML Validator writer may lost params in estimatorParamMaps Key: SPARK-33592 URL: https://issues.apache.org/jira/browse/SPARK-33592 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 3.0.0, 3.1.0 Reporter: Weichen Xu Two typical cases to reproduce it: (1) {code: python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code: python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33517: Assignee: liucht-inspur (was: Apache Spark) > Incorrect menu item display and link in PySpark Usage Guide for Pandas with > Apache Arrow > > > Key: SPARK-33517 > URL: https://issues.apache.org/jira/browse/SPARK-33517 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: liucht-inspur >Assignee: liucht-inspur >Priority: Minor > Fix For: 3.1.0 > > Attachments: image-2020-11-23-18-47-01-591.png, > image-2020-11-27-09-43-58-141.png, spark-doc.jpg > > > Error setting menu item and link, change "Apache Arrow in Spark" to "Apache > Arrow in PySpark" > !image-2020-11-23-18-47-01-591.png! > > after: > !image-2020-11-27-09-43-58-141.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33517: Assignee: Apache Spark > Incorrect menu item display and link in PySpark Usage Guide for Pandas with > Apache Arrow > > > Key: SPARK-33517 > URL: https://issues.apache.org/jira/browse/SPARK-33517 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: liucht-inspur >Assignee: Apache Spark >Priority: Minor > Attachments: image-2020-11-23-18-47-01-591.png, > image-2020-11-27-09-43-58-141.png, spark-doc.jpg > > > Error setting menu item and link, change "Apache Arrow in Spark" to "Apache > Arrow in PySpark" > !image-2020-11-23-18-47-01-591.png! > > after: > !image-2020-11-27-09-43-58-141.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33517. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30466 [https://github.com/apache/spark/pull/30466] > Incorrect menu item display and link in PySpark Usage Guide for Pandas with > Apache Arrow > > > Key: SPARK-33517 > URL: https://issues.apache.org/jira/browse/SPARK-33517 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: liucht-inspur >Assignee: Apache Spark >Priority: Minor > Fix For: 3.1.0 > > Attachments: image-2020-11-23-18-47-01-591.png, > image-2020-11-27-09-43-58-141.png, spark-doc.jpg > > > Error setting menu item and link, change "Apache Arrow in Spark" to "Apache > Arrow in PySpark" > !image-2020-11-23-18-47-01-591.png! > > after: > !image-2020-11-27-09-43-58-141.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
[ https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33585. --- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30526 [https://github.com/apache/spark/pull/30526] > The comment for SQLContext.tables() doesn't mention the `database` column > - > > Key: SPARK-33585 > URL: https://issues.apache.org/jira/browse/SPARK-33585 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > The comment says: "The returned DataFrame has two columns, tableName and > isTemporary": > https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 > but actually the dataframe has 3 columns: > {code:scala} > scala> spark.range(10).createOrReplaceTempView("view1") > scala> val tables = spark.sqlContext.tables() > tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string > ... 1 more field] > scala> tables.printSchema > root > |-- database: string (nullable = false) > |-- tableName: string (nullable = false) > |-- isTemporary: boolean (nullable = false) > scala> tables.show > ++-+---+ > |database|tableName|isTemporary| > ++-+---+ > | default| t1| false| > | default| t2| false| > | default| ymd| false| > ||view1| true| > ++-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
[ https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33585: - Assignee: Maxim Gekk > The comment for SQLContext.tables() doesn't mention the `database` column > - > > Key: SPARK-33585 > URL: https://issues.apache.org/jira/browse/SPARK-33585 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > The comment says: "The returned DataFrame has two columns, tableName and > isTemporary": > https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 > but actually the dataframe has 3 columns: > {code:scala} > scala> spark.range(10).createOrReplaceTempView("view1") > scala> val tables = spark.sqlContext.tables() > tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string > ... 1 more field] > scala> tables.printSchema > root > |-- database: string (nullable = false) > |-- tableName: string (nullable = false) > |-- isTemporary: boolean (nullable = false) > scala> tables.show > ++-+---+ > |database|tableName|isTemporary| > ++-+---+ > | default| t1| false| > | default| t2| false| > | default| ymd| false| > ||view1| true| > ++-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
[ https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33588. --- Fix Version/s: 3.1.0 Assignee: Maxim Gekk Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/30529 > Partition spec in SHOW TABLE EXTENDED doesn't respect > `spark.sql.caseSensitive` > --- > > Key: SPARK-33588 > URL: https://issues.apache.org/jira/browse/SPARK-33588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > > USING parquet > > partitioned by (year, month); > spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; > spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); > Error in query: Partition spec is invalid. The spec (YEAR, Month) must match > the partition spec (year, month) defined in table '`default`.`tbl1`'; > {code} > The spark.sql.caseSensitive flag is false by default, so, the partition spec > is valid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
[ https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33588: -- Affects Version/s: (was: 3.0.2) (was: 2.4.8) 2.4.7 3.0.1 > Partition spec in SHOW TABLE EXTENDED doesn't respect > `spark.sql.caseSensitive` > --- > > Key: SPARK-33588 > URL: https://issues.apache.org/jira/browse/SPARK-33588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > > USING parquet > > partitioned by (year, month); > spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; > spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); > Error in query: Partition spec is invalid. The spec (YEAR, Month) must match > the partition spec (year, month) defined in table '`default`.`tbl1`'; > {code} > The spark.sql.caseSensitive flag is false by default, so, the partition spec > is valid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33591: Assignee: Apache Spark > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240356#comment-17240356 ] Apache Spark commented on SPARK-33591: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30538 > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33591: Assignee: (was: Apache Spark) > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33587) Kill the executor on nested fatal errors
[ https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33587: - Assignee: Shixiong Zhu > Kill the executor on nested fatal errors > > > Key: SPARK-33587 > URL: https://issues.apache.org/jira/browse/SPARK-33587 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently we kill the executor when hitting a fatal error. However, if the > fatal error is wrapped by another exception, such as > - java.util.concurrent.ExecutionException, > com.google.common.util.concurrent.UncheckedExecutionException, > com.google.common.util.concurrent.ExecutionError when using Guava cache and > java thread pool. > - SparkException thrown from this line: > https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 > We will still keep the executor running. Fatal errors are usually > unrecoverable (such as OutOfMemoryError), some components may be in a broken > state when hitting a fatal error. Hence, it's better to detect the nested > fatal error as well and kill the executor. Then we can rely on Spark's fault > tolerance to recover. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33587) Kill the executor on nested fatal errors
[ https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33587. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30528 [https://github.com/apache/spark/pull/30528] > Kill the executor on nested fatal errors > > > Key: SPARK-33587 > URL: https://issues.apache.org/jira/browse/SPARK-33587 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.1.0 > > > Currently we kill the executor when hitting a fatal error. However, if the > fatal error is wrapped by another exception, such as > - java.util.concurrent.ExecutionException, > com.google.common.util.concurrent.UncheckedExecutionException, > com.google.common.util.concurrent.ExecutionError when using Guava cache and > java thread pool. > - SparkException thrown from this line: > https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 > We will still keep the executor running. Fatal errors are usually > unrecoverable (such as OutOfMemoryError), some components may be in a broken > state when hitting a fatal error. Hence, it's better to detect the nested > fatal error as well and kill the executor. Then we can rely on Spark's fault > tolerance to recover. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33591: --- Issue Type: Bug (was: Improvement) > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33591: --- Issue Type: Improvement (was: Bug) > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33591) NULL is recognized as the "null" string in partition specs
Maxim Gekk created SPARK-33591: -- Summary: NULL is recognized as the "null" string in partition specs Key: SPARK-33591 URL: https://issues.apache.org/jira/browse/SPARK-33591 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk For example: {code:sql} spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false {code} The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33590: - Assignee: Kazuaki Ishizaki > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33590. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30537 [https://github.com/apache/spark/pull/30537] > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 3.1.0 > > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33531) [SQL] Reduce shuffle task number when calling CollectLimitExec#executeToIterator
[ https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mori[A]rty updated SPARK-33531: --- Description: Currently, when invoking CollectLimitExec#executeToIterator, a single-partition ShuffledRowRDD containg all parent partitions is created. Spark will compute all these partitions to get the result. But in most cases, computing the first few partitions is enought to get the result, which takes much less time. When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect is enabled, too many shuffle tasks will lead to a significant performance issue for SQLs terminated with LIMIT. A possible improvement may be as follows: # Create a ShuffledRowRDD containing the first parent partition. # Collect rows of this ShuffledRowRDD to driver # If collected rows is less than limit number, then create the next ShuffledRowRDD with serveral following parent partitions. The number of parent partitions is calculated the same way as SparkPlan#executeTake. # Repeat 2~3 until total collected rows reaches limit number or all parent partitions have been computed. was: Using a new method SparkPlan#executeTakeToIterator to implement CollectLimitExec#executeToIterator to avoid shuffle caused by invoking parent method SparkPlan#executeToIterator. When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect is enabled, extra shuffle will lead to a significant performance issue for SQLs terminated with LIMIT. > [SQL] Reduce shuffle task number when calling > CollectLimitExec#executeToIterator > > > Key: SPARK-33531 > URL: https://issues.apache.org/jira/browse/SPARK-33531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1 >Reporter: Mori[A]rty >Priority: Major > > Currently, when invoking CollectLimitExec#executeToIterator, a > single-partition ShuffledRowRDD containg all parent partitions is created. > Spark will compute all these partitions to get the result. > But in most cases, computing the first few partitions is enought to get the > result, which takes much less time. > When running a SparkThriftServer and > spark.sql.thriftServer.incrementalCollect is enabled, too many shuffle tasks > will lead to a significant performance issue for SQLs terminated with LIMIT. > A possible improvement may be as follows: > # Create a ShuffledRowRDD containing the first parent partition. > # Collect rows of this ShuffledRowRDD to driver > # If collected rows is less than limit number, then create the next > ShuffledRowRDD with serveral following parent partitions. The number of > parent partitions is calculated the same way as SparkPlan#executeTake. > # Repeat 2~3 until total collected rows reaches limit number or all parent > partitions have been computed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33531) [SQL] Reduce shuffle task number when calling CollectLimitExec#executeToIterator
[ https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mori[A]rty updated SPARK-33531: --- Summary: [SQL] Reduce shuffle task number when calling CollectLimitExec#executeToIterator (was: [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator) > [SQL] Reduce shuffle task number when calling > CollectLimitExec#executeToIterator > > > Key: SPARK-33531 > URL: https://issues.apache.org/jira/browse/SPARK-33531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1 >Reporter: Mori[A]rty >Priority: Major > > Using a new method SparkPlan#executeTakeToIterator to implement > CollectLimitExec#executeToIterator to avoid shuffle caused by invoking parent > method SparkPlan#executeToIterator. > When running a SparkThriftServer and > spark.sql.thriftServer.incrementalCollect is enabled, extra shuffle will lead > to a significant performance issue for SQLs terminated with LIMIT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240284#comment-17240284 ] Apache Spark commented on SPARK-33590: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/30537 > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33590: Assignee: (was: Apache Spark) > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240282#comment-17240282 ] Apache Spark commented on SPARK-33590: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/30537 > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33590: Assignee: Apache Spark > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
Kazuaki Ishizaki created SPARK-33590: Summary: Missing submenus for Performance Tuning in Spark SQL Guide Key: SPARK-33590 URL: https://issues.apache.org/jira/browse/SPARK-33590 Project: Spark Issue Type: Bug Components: docs Affects Versions: 3.0.1, 3.0.0 Reporter: Kazuaki Ishizaki Attachments: image-2020-11-30-00-04-07-969.png Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) are missing !image-2020-11-30-00-03-04-814.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-33590: - Attachment: image-2020-11-30-00-04-07-969.png > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-03-04-814.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-33590: - Description: Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) are missing !image-2020-11-30-00-04-07-969.png! was: Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) are missing !image-2020-11-30-00-03-04-814.png! > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33589) Add try catch when opening session
[ https://issues.apache.org/jira/browse/SPARK-33589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240260#comment-17240260 ] Apache Spark commented on SPARK-33589: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30536 > Add try catch when opening session > -- > > Key: SPARK-33589 > URL: https://issues.apache.org/jira/browse/SPARK-33589 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33589) Add try catch when opening session
[ https://issues.apache.org/jira/browse/SPARK-33589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240259#comment-17240259 ] Apache Spark commented on SPARK-33589: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30536 > Add try catch when opening session > -- > > Key: SPARK-33589 > URL: https://issues.apache.org/jira/browse/SPARK-33589 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33589) Add try catch when opening session
[ https://issues.apache.org/jira/browse/SPARK-33589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33589: Assignee: (was: Apache Spark) > Add try catch when opening session > -- > > Key: SPARK-33589 > URL: https://issues.apache.org/jira/browse/SPARK-33589 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33589) Add try catch when opening session
[ https://issues.apache.org/jira/browse/SPARK-33589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33589: Assignee: Apache Spark > Add try catch when opening session > -- > > Key: SPARK-33589 > URL: https://issues.apache.org/jira/browse/SPARK-33589 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33589) Add try catch when opening session
Yuming Wang created SPARK-33589: --- Summary: Add try catch when opening session Key: SPARK-33589 URL: https://issues.apache.org/jira/browse/SPARK-33589 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org