[jira] [Resolved] (SPARK-30898) The behavior of MakeDecimal should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30898. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27656 [https://github.com/apache/spark/pull/27656] > The behavior of MakeDecimal should not depend on SQLConf.get > > > Key: SPARK-30898 > URL: https://issues.apache.org/jira/browse/SPARK-30898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Peter Toth >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30898) The behavior of MakeDecimal should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30898: Assignee: Peter Toth > The behavior of MakeDecimal should not depend on SQLConf.get > > > Key: SPARK-30898 > URL: https://issues.apache.org/jira/browse/SPARK-30898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30897) The behavior of ArrayExists should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30897. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27655 [https://github.com/apache/spark/pull/27655] > The behavior of ArrayExists should not depend on SQLConf.get > > > Key: SPARK-30897 > URL: https://issues.apache.org/jira/browse/SPARK-30897 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Peter Toth >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30897) The behavior of ArrayExists should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30897: Assignee: Peter Toth > The behavior of ArrayExists should not depend on SQLConf.get > > > Key: SPARK-30897 > URL: https://issues.apache.org/jira/browse/SPARK-30897 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30893) Expressions should not change its data type/behavior after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043195#comment-17043195 ] Hyukjin Kwon commented on SPARK-30893: -- I am sure there are already multiple inconsistent instances out there. Probably some configurations would need more destructive fixes. Are they worthwhile? I am not sure. It seems to me a bit unlikely users set a different configurations that change behaviours between queries. For these data type related instances, they look easy to fix so probably fine for now. I am not so supportive of fixing other instances. > Expressions should not change its data type/behavior after it's created > --- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30924) Add additional validation into Merge Into
[ https://issues.apache.org/jira/browse/SPARK-30924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30924: --- Assignee: Burak Yavuz > Add additional validation into Merge Into > - > > Key: SPARK-30924 > URL: https://issues.apache.org/jira/browse/SPARK-30924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > Merge Into is currently missing additional validation around: > 1. The lack of any WHEN statements > 2. Single use of UPDATE/DELETE > 3. The first WHEN MATCHED statement needs to have a condition if there are > two WHEN MATCHED statements. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30924) Add additional validation into Merge Into
[ https://issues.apache.org/jira/browse/SPARK-30924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30924. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27677 [https://github.com/apache/spark/pull/27677] > Add additional validation into Merge Into > - > > Key: SPARK-30924 > URL: https://issues.apache.org/jira/browse/SPARK-30924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Merge Into is currently missing additional validation around: > 1. The lack of any WHEN statements > 2. Single use of UPDATE/DELETE > 3. The first WHEN MATCHED statement needs to have a condition if there are > two WHEN MATCHED statements. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30922) Remove the max split config after changing the multi sub joins to multi sub partitions
[ https://issues.apache.org/jira/browse/SPARK-30922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30922. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27673 [https://github.com/apache/spark/pull/27673] > Remove the max split config after changing the multi sub joins to multi sub > partitions > -- > > Key: SPARK-30922 > URL: https://issues.apache.org/jira/browse/SPARK-30922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > > After merged PR#27493, we not need the > "spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits" config > to resolve the ui issue when split more sub joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30922) Remove the max split config after changing the multi sub joins to multi sub partitions
[ https://issues.apache.org/jira/browse/SPARK-30922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30922: --- Assignee: Ke Jia > Remove the max split config after changing the multi sub joins to multi sub > partitions > -- > > Key: SPARK-30922 > URL: https://issues.apache.org/jira/browse/SPARK-30922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > > After merged PR#27493, we not need the > "spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits" config > to resolve the ui issue when split more sub joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30936) Forwards-compatibility in JsonProtocol in broken
[ https://issues.apache.org/jira/browse/SPARK-30936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-30936: - Summary: Forwards-compatibility in JsonProtocol in broken (was: Fix the broken forwards-compatibility in JsonProtocol) > Forwards-compatibility in JsonProtocol in broken > > > Key: SPARK-30936 > URL: https://issues.apache.org/jira/browse/SPARK-30936 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Major > > JsonProtocol is supposed to provide strong backwards-compatibility and > forwards-compatibility guarantees: any version of Spark should be able to > read JSON output written by any other version, including newer versions. > However, the forwards-compatibility guarantee is broken for events parsed by > "ObjectMapper". If a new field is added to an event parsed by "ObjectMapper" > (e.g., > https://github.com/apache/spark/commit/6dc5921e66d56885b95c07e56e687f9f6c1eaca7#diff-dc5c7a41fbb7479cef48b67eb41ad254R33), > this event cannot be parsed by an old version of Spark History Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30936) Fix the broken forwards-compatibility in JsonProtocol
Shixiong Zhu created SPARK-30936: Summary: Fix the broken forwards-compatibility in JsonProtocol Key: SPARK-30936 URL: https://issues.apache.org/jira/browse/SPARK-30936 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Shixiong Zhu JsonProtocol is supposed to provide strong backwards-compatibility and forwards-compatibility guarantees: any version of Spark should be able to read JSON output written by any other version, including newer versions. However, the forwards-compatibility guarantee is broken for events parsed by "ObjectMapper". If a new field is added to an event parsed by "ObjectMapper" (e.g., https://github.com/apache/spark/commit/6dc5921e66d56885b95c07e56e687f9f6c1eaca7#diff-dc5c7a41fbb7479cef48b67eb41ad254R33), this event cannot be parsed by an old version of Spark History Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds
[ https://issues.apache.org/jira/browse/SPARK-30925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30925: --- Assignee: Maxim Gekk > Overflow/round errors in conversions of milliseconds to/from microseconds > - > > Key: SPARK-30925 > URL: https://issues.apache.org/jira/browse/SPARK-30925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Spark has special methods in DataTimeUtils for converting microseconds > from/to milliseconds - `fromMillis` and `toMillis()`. The methods handle > arithmetic overflow and round negative values. The ticket aims to review all > places in Spark SQL where microseconds are converted from/to milliseconds, > and replace them by util methods from DateTimeUtils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds
[ https://issues.apache.org/jira/browse/SPARK-30925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30925. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27676 [https://github.com/apache/spark/pull/27676] > Overflow/round errors in conversions of milliseconds to/from microseconds > - > > Key: SPARK-30925 > URL: https://issues.apache.org/jira/browse/SPARK-30925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Spark has special methods in DataTimeUtils for converting microseconds > from/to milliseconds - `fromMillis` and `toMillis()`. The methods handle > arithmetic overflow and round negative values. The ticket aims to review all > places in Spark SQL where microseconds are converted from/to milliseconds, > and replace them by util methods from DateTimeUtils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30923: - Component/s: PySpark > Spark MLlib, GraphX 3.0 QA umbrella > --- > > Key: SPARK-30923 > URL: https://issues.apache.org/jira/browse/SPARK-30923 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > Description > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043150#comment-17043150 ] zhengruifeng commented on SPARK-30923: -- refering to pervious ticket https://issues.apache.org/jira/browse/SPARK-25319 > Spark MLlib, GraphX 3.0 QA umbrella > --- > > Key: SPARK-30923 > URL: https://issues.apache.org/jira/browse/SPARK-30923 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > Description > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30923: - Issue Type: Umbrella (was: Task) > Spark MLlib, GraphX 3.0 QA umbrella > --- > > Key: SPARK-30923 > URL: https://issues.apache.org/jira/browse/SPARK-30923 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > Description > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30935) Update MLlib, GraphX websites for 3.0
zhengruifeng created SPARK-30935: Summary: Update MLlib, GraphX websites for 3.0 Key: SPARK-30935 URL: https://issues.apache.org/jira/browse/SPARK-30935 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 3.0.0 Reporter: zhengruifeng Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30934) ML, GraphX 3.0 QA: Programming guide update and migration guide
zhengruifeng created SPARK-30934: Summary: ML, GraphX 3.0 QA: Programming guide update and migration guide Key: SPARK-30934 URL: https://issues.apache.org/jira/browse/SPARK-30934 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 3.0.0 Reporter: zhengruifeng Before the release, we need to update the MLlib and GraphX Programming Guides. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs. * Check phrasing, especially in main sections (for outdated items such as "In this release, ...") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs
zhengruifeng created SPARK-30933: Summary: ML, GraphX 3.0 QA: Update user guide for new features & APIs Key: SPARK-30933 URL: https://issues.apache.org/jira/browse/SPARK-30933 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 3.0.0 Reporter: zhengruifeng Check the user guide vs. a list of new APIs (classes, methods, data members) to see what items require updates to the user guide. For each feature missing user guide doc: * Create a JIRA for that feature, and assign it to the author of the feature * Link it to (a) the original JIRA which introduced that feature ("related to") and (b) to this JIRA ("requires"). For MLlib: * This task does not include major reorganizations for the programming guide. * We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide. If you would like to work on this task, please comment, and we can create & link JIRAs for parts of this work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs
zhengruifeng created SPARK-30932: Summary: ML 3.0 QA: API: Java compatibility, docs Key: SPARK-30932 URL: https://issues.apache.org/jira/browse/SPARK-30932 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Affects Versions: 3.0.0 Reporter: zhengruifeng Check Java compatibility for this release: * APIs in {{spark.ml}} * New APIs in {{spark.mllib}} (There should be few, if any.) Checking compatibility means: * Checking for differences in how Scala and Java handle types. Some items to look out for are: ** Check for generic "Object" types where Java cannot understand complex Scala types. *** *Note*: The Java docs do not always match the bytecode. If you find a problem, please verify it using {{javap}}. ** Check Scala objects (especially with nesting!) carefully. These may not be understood in Java, or they may be accessible only via the weirdly named Java types (with "$" or "#") which are generated by the Scala compiler. ** Check for uses of Scala and Java enumerations, which can show up oddly in the other language's doc. (In {{spark.ml}}, we have largely tried to avoid using enumerations, and have instead favored plain strings.) * Check for differences in generated Scala vs Java docs. E.g., one past issue was that Javadocs did not respect Scala's package private modifier. If you find issues, please comment here, or for larger items, create separate JIRAs and link here as "requires". * Remember that we should not break APIs from previous releases. If you find a problem, check if it was introduced in this Spark release (in which case we can fix it) or in a previous one (in which case we can create a java-friendly version of the API). * If needed for complex issues, create small Java unit tests which execute each method. (Algorithmic correctness can be checked in Scala.) Recommendations for how to complete this task: * There are not great tools. In the past, this task has been done by: ** Generating API docs ** Building JAR and outputting the Java class signatures for MLlib ** Manually inspecting and searching the docs and class signatures for issues * If you do have ideas for better tooling, please say so we can make this task easier in the future! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30931) ML 3.0 QA: API: Python API coverage
zhengruifeng created SPARK-30931: Summary: ML 3.0 QA: API: Python API coverage Key: SPARK-30931 URL: https://issues.apache.org/jira/browse/SPARK-30931 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng For new public APIs added to MLlib ({{spark.ml}} only), we need to check the generated HTML doc and compare the Scala & Python versions. * *GOAL*: Audit and create JIRAs to fix in the next release. * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. *Please use a _separate_ JIRA (linked below as "requires") for this list of to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit
zhengruifeng created SPARK-30930: Summary: ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit Key: SPARK-30930 URL: https://issues.apache.org/jira/browse/SPARK-30930 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 3.0.0 Reporter: zhengruifeng We should make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. We should also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30929) ML, GraphX 3.0 QA: API: New Scala APIs, docs
zhengruifeng created SPARK-30929: Summary: ML, GraphX 3.0 QA: API: New Scala APIs, docs Key: SPARK-30929 URL: https://issues.apache.org/jira/browse/SPARK-30929 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 3.0.0 Environment: Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. For *user guide issues* link the new JIRAs to the relevant user guide QA issue Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes
zhengruifeng created SPARK-30928: Summary: ML, GraphX 3.0 QA: API: Binary incompatible changes Key: SPARK-30928 URL: https://issues.apache.org/jira/browse/SPARK-30928 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 3.0.0 Reporter: zhengruifeng Generate a list of binary incompatible changes using MiMa and create new JIRAs for issues found. Filter out false positives as needed. If you want to take this task, look at the analogous task from the previous release QA, and ping the Assignee for advice. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043147#comment-17043147 ] zhengruifeng commented on SPARK-30923: -- [~smilegator] Sure! > Spark MLlib, GraphX 3.0 QA umbrella > --- > > Key: SPARK-30923 > URL: https://issues.apache.org/jira/browse/SPARK-30923 > Project: Spark > Issue Type: Task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > Description > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30867) add FValueRegressionTest
[ https://issues.apache.org/jira/browse/SPARK-30867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30867: Assignee: Huaxin Gao > add FValueRegressionTest > > > Key: SPARK-30867 > URL: https://issues.apache.org/jira/browse/SPARK-30867 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Add FValueRegressionTest in ML.stat. This will be used for > FValueRegressionSelector. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30867) add FValueRegressionTest
[ https://issues.apache.org/jira/browse/SPARK-30867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30867. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27623 [https://github.com/apache/spark/pull/27623] > add FValueRegressionTest > > > Key: SPARK-30867 > URL: https://issues.apache.org/jira/browse/SPARK-30867 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > Add FValueRegressionTest in ML.stat. This will be used for > FValueRegressionSelector. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30923: - Summary: Spark MLlib, GraphX 3.0 QA umbrella (was: Spark MLlib, GraphX 2.4 QA umbrella) > Spark MLlib, GraphX 3.0 QA umbrella > --- > > Key: SPARK-30923 > URL: https://issues.apache.org/jira/browse/SPARK-30923 > Project: Spark > Issue Type: Task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > Description > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30927) StreamingQueryManager should avoid keeping reference to terminated StreamingQuery
Shixiong Zhu created SPARK-30927: Summary: StreamingQueryManager should avoid keeping reference to terminated StreamingQuery Key: SPARK-30927 URL: https://issues.apache.org/jira/browse/SPARK-30927 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Shixiong Zhu Right now StreamingQueryManager will keep the last terminated query until "resetTerminated" is called. When the last terminated query has lots of states (a large sql plan, cached RDDs, etc.), it will waste these memory unnecessarily. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result
[ https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bozhidar Karaargirov updated SPARK-30926: - Description: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color}) df.createTempView({color:#008000}"airQualityP"{color}) {color:#80}val {color}result = {color:#660e7a}session{color} .sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color}) println(result.count()) And this is how I transform the csv into parquets: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}) .csv({color:#660e7a}originalDataset{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) df.write.parquet({color:#660e7a}bigParquetDataset{color}) These are the two mapping functions: {color:#80}val {color}{color:#660e7a}mappingFunction {color}= { r: Row => ParticleAirQuality( r.getString({color:#ff}1{color}), r.getString({color:#ff}2{color}), r.getString({color:#ff}3{color}), r.getString({color:#ff}4{color}), r.getString({color:#ff}5{color}), { {color:#80}val {color}p1 = r.getString({color:#ff}6{color}) {color:#80}if{color}(p1 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble }, { {color:#80}val {color}p2 = r.getString({color:#ff}7{color}) {color:#80}if{color}(p2 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble } ) } {color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= { r: Row => ParticleAirQuality( r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}), r.getAs[Double]({color:#008000}"P1"{color}), r.getAs[Double]({color:#008000}"P2"{color}) ) } If it matters this is the paths (Note that I actually use double \ instead of / since it is windows - but that doesn't really matter): {color:#80}val {color}{color:#660e7a}originalDataset {color}= {color:#008000}"D:/source/datasets/sofia-air-quality-dataset/*{color}{color:#008000}*sds**.csv"{color} {color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= {color:#008000}"D:/source/datasets/air-tests/all-parquet"{color} The count from the csvs I get is: 33934609 While the count from the parquets is: 35739394 was: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import
[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result
[ https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bozhidar Karaargirov updated SPARK-30926: - Description: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color}) df.createTempView({color:#008000}"airQualityP"{color}) {color:#80}val {color}result = {color:#660e7a}session{color} .sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color}) println(result.count()) And this is how I transform the csv into parquets: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}) .csv({color:#660e7a}originalDataset{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) df.write.parquet({color:#660e7a}bigParquetDataset{color}) These are the two mapping functions: {color:#80}val {color}{color:#660e7a}mappingFunction {color}= { r: Row => ParticleAirQuality( r.getString({color:#ff}1{color}), r.getString({color:#ff}2{color}), r.getString({color:#ff}3{color}), r.getString({color:#ff}4{color}), r.getString({color:#ff}5{color}), { {color:#80}val {color}p1 = r.getString({color:#ff}6{color}) {color:#80}if{color}(p1 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble }, { {color:#80}val {color}p2 = r.getString({color:#ff}7{color}) {color:#80}if{color}(p2 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble } ) } {color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= { r: Row => ParticleAirQuality( r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}), r.getAs[Double]({color:#008000}"P1"{color}), r.getAs[Double]({color:#008000}"P2"{color}) ) } If it matters this is the paths (Note that I actually use \\ instead of / since it is windows - but that doesn't really matter): {color:#80}val {color}{color:#660e7a}originalDataset {color}= {color:#008000}"D:/source/datasets/sofia-air-quality-dataset/*{color}{color:#008000}*sds**.csv"{color} {color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= {color:#008000}"D:/source/datasets/air-tests/all-parquet"{color} The count from the csvs I get is: 33934609 While the count from the parquets is: 35739394 was: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import
[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result
[ https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bozhidar Karaargirov updated SPARK-30926: - Description: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color}) df.createTempView({color:#008000}"airQualityP"{color}) {color:#80}val {color}result = {color:#660e7a}session{color} .sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color}) println(result.count()) And this is how I transform the csv into parquets: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}) .csv({color:#660e7a}originalDataset{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) df.write.parquet({color:#660e7a}bigParquetDataset{color}) These are the two mapping functions: {color:#80}val {color}{color:#660e7a}mappingFunction {color}= { r: Row => ParticleAirQuality( r.getString({color:#ff}1{color}), r.getString({color:#ff}2{color}), r.getString({color:#ff}3{color}), r.getString({color:#ff}4{color}), r.getString({color:#ff}5{color}), { {color:#80}val {color}p1 = r.getString({color:#ff}6{color}) {color:#80}if{color}(p1 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble }, { {color:#80}val {color}p2 = r.getString({color:#ff}7{color}) {color:#80}if{color}(p2 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble } ) } {color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= { r: Row => ParticleAirQuality( r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}), r.getAs[Double]({color:#008000}"P1"{color}), r.getAs[Double]({color:#008000}"P2"{color}) ) } If it matters this is the paths: {color:#80}val {color}{color:#660e7a}originalDataset {color}= {color:#008000}"D:/source/datasets/sofia-air-quality-dataset/*{color}{color:#008000}*sds**.csv"{color} {color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= {color:#008000}"D:/source/datasets/air-tests/all-parquet"{color} The count from the csvs I get is: 33934609 While the count from the parquets is: 35739394 was: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df =
[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result
[ https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bozhidar Karaargirov updated SPARK-30926: - Description: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color}) df.createTempView({color:#008000}"airQualityP"{color}) {color:#80}val {color}result = {color:#660e7a}session{color} .sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color}) println(result.count()) And this is how I transform the csv into parquets: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}) .csv({color:#660e7a}originalDataset{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) df.write.parquet({color:#660e7a}bigParquetDataset{color}) These are the two mapping functions: {color:#80}val {color}{color:#660e7a}mappingFunction {color}= { r: Row => ParticleAirQuality( r.getString({color:#ff}1{color}), r.getString({color:#ff}2{color}), r.getString({color:#ff}3{color}), r.getString({color:#ff}4{color}), r.getString({color:#ff}5{color}), { {color:#80}val {color}p1 = r.getString({color:#ff}6{color}) {color:#80}if{color}(p1 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble }, { {color:#80}val {color}p2 = r.getString({color:#ff}7{color}) {color:#80}if{color}(p2 == {color:#80}null{color}) Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble } ) } {color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= { r: Row => ParticleAirQuality( r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}), r.getAs[Double]({color:#008000}"P1"{color}), r.getAs[Double]({color:#008000}"P2"{color}) ) } If it matters this is the paths: {color:#80}val {color}{color:#660e7a}originalDataset {color}= {color:#008000}"D:\{color}{color:#008000}source\{color}{color:#008000}datasets\{color}{color:#008000}sofia-air-quality-dataset\*{color}{color:#008000}*sds**.csv"{color} {color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= {color:#008000}"D:\{color}{color:#008000}source\{color}{color:#008000}datasets\{color}{color:#008000}air-tests\{color}{color:#008000}all-parquet"{color} The count from the csvs I get is: 33934609 While the count from the parquets is: 35739394 was: SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import
[jira] [Created] (SPARK-30926) Same SQL on CSV and on Parquet gives different result
Bozhidar Karaargirov created SPARK-30926: Summary: Same SQL on CSV and on Parquet gives different result Key: SPARK-30926 URL: https://issues.apache.org/jira/browse/SPARK-30926 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Environment: I run this locally on a windows 10 machine. The java runtime is: {color:#cc}openjdk 11.0.5 2019-10-15 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode){color} Reporter: Bozhidar Karaargirov SO I played around with a data set from here: [https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset] I ran the same query for the base CSVs and against a parquet version of them: {color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color} Here is the csv code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color}) df.createTempView({color:#008000}"airQuality"{color}) {color:#80}val {color}result = {color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) println(result.count()) Here is the parquet code: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color}) df.createTempView({color:#008000}"airQualityP"{color}) {color:#80}val {color}result = {color:#660e7a}session {color} .sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color}) .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color}) println(result.count()) And this is how I transform the csv into parquets: {color:#80}import {color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._ {color:#80}val {color}df = {color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, {color:#008000}"true"{color}) .csv({color:#660e7a}originalDataset{color}) .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color}) df.write.parquet({color:#660e7a}bigParquetDataset{color}) These are the two mapping functions: {color:#80}val {color}{color:#660e7a}mappingFunction {color}= { r: Row => ParticleAirQuality( r.getString({color:#ff}1{color}), r.getString({color:#ff}2{color}), r.getString({color:#ff}3{color}), r.getString({color:#ff}4{color}), r.getString({color:#ff}5{color}), { {color:#80}val {color}p1 = r.getString({color:#ff}6{color}) {color:#80}if{color}(p1 == {color:#80}null{color}) Double.{color:#660e7a}NaN {color} {color:#80}else {color}p1.toDouble }, { {color:#80}val {color}p2 = r.getString({color:#ff}7{color}) {color:#80}if{color}(p2 == {color:#80}null{color}) Double.{color:#660e7a}NaN {color} {color:#80}else {color}p2.toDouble } ) } {color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= { r: Row => ParticleAirQuality( r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}), r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}), r.getAs[Double]({color:#008000}"P1"{color}), r.getAs[Double]({color:#008000}"P2"{color}) ) } If it matters this is the paths: {color:#80}val {color}{color:#660e7a}originalDataset {color}= {color:#008000}"D:{color}{color:#80}\\{color}{color:#008000}source{color}{color:#80}\\{color}{color:#008000}datasets{color}{color:#80}\\{color}{color:#008000}sofia-air-quality-dataset{color}{color:#80}\\{color}{color:#008000}*sds*.csv" {color}{color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= {color:#008000}"D:{color}{color:#80}\\{color}{color:#008000}source{color}{color:#80}\\{color}{color:#008000}datasets{color}{color:#80}\\{color}{color:#008000}air-tests{color}{color:#80}\\{color}{color:#008000}all-parquet"{color} The count from the csvs I get is: 33934609 While the count from the parquets is: 35739394 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield reopened SPARK-30332: - Added code for reproduce > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, > T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalised,
[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042851#comment-17042851 ] Izek Greenfield commented on SPARK-30332: - Code to reproduce the problem: {code:scala} import java.nio.file.{Files, Paths} import org.apache.spark.sql.SparkSession object Test { def main(args: Array[String]): Unit = { val spark = { SparkSession .builder() .master("local[*]") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.cbo.enabled", "true") .config("spark.scheduler.mode", "FAIR") .config("spark.sql.crossJoin.enabled", "true") .config("spark.sql.adaptive.enabled", "true") .config("spark.sql.parquet.filterPushdown", "true") .config("spark.sql.shuffle.partitions", "500") .config("spark.executor.heartbeatInterval", "600s") .config("spark.network.timeout", "1200s") .config("spark.sql.broadcastTimeout", "1200s") .config("spark.shuffle.file.buffer", "64k") .appName("error") .enableHiveSupport() .getOrCreate() } val pathToCsvFiles = "db" import scala.collection.JavaConverters._ Files.walk(Paths.get(pathToCsvFiles)).iterator().asScala.map(_.toFile).foreach{ file => if (!file.isDirectory){ val name = file.getName spark.read.format("csv") .option("inferSchema", "true") .option("header", "true") .option("mode", "DROPMALFORMED") .load(file.getAbsolutePath) .createOrReplaceGlobalTempView(name.split("\\.").head) } } spark.sql( """ |SELECT BT_capital.asof_date, |BT_capital.run_id, |BT_capital.v, |BT_capital.id, |BT_capital.entity, |BT_capital.level_1, |BT_capital.level_2, |BT_capital.level_3, |BT_capital.level_4, |BT_capital.level_5, |BT_capital.level_6, |BT_capital.path_bt_capital, |BT_capital.line_item, |t0.target_line_item, |t0.line_description, |BT_capital.col_item, |BT_capital.rep_amount, |root.orgUnitId, |root.cptyId, |root.instId, |root.startDate, |root.maturityDate, |root.amount, |root.nominalAmount, |root.quantity, |root.lkupAssetLiability, |root.lkupCurrency, |root.lkupProdType, |root.interestResetDate, |root.interestResetTerm, |root.noticePeriod, |root.historicCostAmount, |root.dueDate, |root.lkupResidence, |root.lkupCountryOfUltimateRisk, |root.lkupSector, |root.lkupIndustry, |root.lkupAccountingPortfolioType, |root.lkupLoanDepositTerm, |root.lkupFixedFloating, |root.lkupCollateralType, |root.lkupRiskType, |root.lkupEligibleRefinancing, |root.lkupHedging, |root.lkupIsOwnIssued, |root.lkupIsSubordinated, |root.lkupIsQuoted, |root.lkupIsSecuritised, |root.lkupIsSecuritisedServiced, |root.lkupIsSyndicated, |root.lkupIsDeRecognised, |root.lkupIsRenegotiated, |root.lkupIsTransferable, |root.lkupIsNewBusiness, |root.lkupIsFiduciary, |root.lkupIsNonPerforming, |root.lkupIsInterGroup, |root.lkupIsIntraGroup, |root.lkupIsRediscounted, |root.lkupIsCollateral, |root.lkupIsExercised, |root.lkupIsImpaired, |root.facilityId, |root.lkupIsOTC, |root.lkupIsDefaulted, |root.lkupIsSavingsPosition, |root.lkupIsForborne, |root.lkupIsDebtRestructuringLoan, |root.interestRateAAR, |root.interestRateAPRC, |root.custom1, |root.custom2, |root.custom3, |root.lkupSecuritisationType, |root.lkupIsCashPooling, |root.lkupIsEquityParticipationGTE10, |root.lkupIsConvertible, |root.lkupEconomicHedge, |root.lkupIsNonCurrHeldForSale, |root.lkupIsEmbeddedDerivative, |root.lkupLoanPurpose, |root.lkupRegulated, |root.lkupRepaymentType, |root.glAccount, |root.lkupIsRecourse, |root.lkupIsNotFullyGuaranteed, |root.lkupImpairmentStage, |root.lkupIsEntireAmountWrittenOff, |root.lkupIsLowCreditRisk, |root.lkupIsOBSWithinIFRS9, |root.lkupIsUnderSpecialSurveillance, |root.lkupProtection, |root.lkupIsGeneralAllowance, |root.lkupSectorUltimateRisk, |root.cptyOrgUnitId, |root.name, |root.lkupNationality, |root.lkupSize, |root.lkupIsSPV, |root.lkupIsCentralCounterparty, |root.lkupIsMMRMFI, |root.lkupIsKeyManagement, |root.lkupIsOtherRelatedParty,
[jira] [Created] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds
Maxim Gekk created SPARK-30925: -- Summary: Overflow/round errors in conversions of milliseconds to/from microseconds Key: SPARK-30925 URL: https://issues.apache.org/jira/browse/SPARK-30925 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Spark has special methods in DataTimeUtils for converting microseconds from/to milliseconds - `fromMillis` and `toMillis()`. The methods handle arithmetic overflow and round negative values. The ticket aims to review all places in Spark SQL where microseconds are converted from/to milliseconds, and replace them by util methods from DateTimeUtils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30844) Static partition should also follow StoreAssignmentPolicy when insert into table
[ https://issues.apache.org/jira/browse/SPARK-30844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30844. -- Fix Version/s: 3.0.0 Assignee: wuyi Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/27597] > Static partition should also follow StoreAssignmentPolicy when insert into > table > > > Key: SPARK-30844 > URL: https://issues.apache.org/jira/browse/SPARK-30844 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Static partition, currently, use common cast whatever the > StoreAssignmentPolicy is. We should make it also follow the > StoreAssignmentPolicy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30822) Pyspark queries fail if terminated with a semicolon
[ https://issues.apache.org/jira/browse/SPARK-30822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-30822: - Flags: (was: Patch) > Pyspark queries fail if terminated with a semicolon > --- > > Key: SPARK-30822 > URL: https://issues.apache.org/jira/browse/SPARK-30822 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Samuel Setegne >Priority: Minor > Original Estimate: 10m > Remaining Estimate: 10m > > When a user submits a directly executable SQL statement terminated with a > semicolon, they receive a > `org.apache.spark.sql.catalyst.parser.ParseException` of `mismatched input > ";"`. SQL-92 describes a direct SQL statement as having the format of > ` ` and the majority of SQL > implementations either require the semicolon as a statement terminator, or > make it optional (meaning not raising an exception when it's included, > seemingly in recognition that it's a common behavior). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30822) Pyspark queries fail if terminated with a semicolon
[ https://issues.apache.org/jira/browse/SPARK-30822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-30822: - Labels: (was: easyfix patch pull-request-available) > Pyspark queries fail if terminated with a semicolon > --- > > Key: SPARK-30822 > URL: https://issues.apache.org/jira/browse/SPARK-30822 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Samuel Setegne >Priority: Minor > Original Estimate: 10m > Remaining Estimate: 10m > > When a user submits a directly executable SQL statement terminated with a > semicolon, they receive a > `org.apache.spark.sql.catalyst.parser.ParseException` of `mismatched input > ";"`. SQL-92 describes a direct SQL statement as having the format of > ` ` and the majority of SQL > implementations either require the semicolon as a statement terminator, or > make it optional (meaning not raising an exception when it's included, > seemingly in recognition that it's a common behavior). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org