[jira] [Resolved] (SPARK-30898) The behavior of MakeDecimal should not depend on SQLConf.get

2020-02-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30898.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27656
[https://github.com/apache/spark/pull/27656]

> The behavior of MakeDecimal should not depend on SQLConf.get
> 
>
> Key: SPARK-30898
> URL: https://issues.apache.org/jira/browse/SPARK-30898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30898) The behavior of MakeDecimal should not depend on SQLConf.get

2020-02-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30898:


Assignee: Peter Toth

> The behavior of MakeDecimal should not depend on SQLConf.get
> 
>
> Key: SPARK-30898
> URL: https://issues.apache.org/jira/browse/SPARK-30898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30897) The behavior of ArrayExists should not depend on SQLConf.get

2020-02-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30897.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27655
[https://github.com/apache/spark/pull/27655]

> The behavior of ArrayExists should not depend on SQLConf.get
> 
>
> Key: SPARK-30897
> URL: https://issues.apache.org/jira/browse/SPARK-30897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30897) The behavior of ArrayExists should not depend on SQLConf.get

2020-02-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30897:


Assignee: Peter Toth

> The behavior of ArrayExists should not depend on SQLConf.get
> 
>
> Key: SPARK-30897
> URL: https://issues.apache.org/jira/browse/SPARK-30897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30893) Expressions should not change its data type/behavior after it's created

2020-02-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043195#comment-17043195
 ] 

Hyukjin Kwon commented on SPARK-30893:
--

I am sure there are already multiple inconsistent instances out there. Probably 
some configurations would need more destructive fixes. Are they worthwhile? I 
am not sure.
It seems to me a bit unlikely users set a different configurations that change 
behaviours between queries.
For these data type related instances, they look easy to fix so probably fine 
for now. I am not so supportive of fixing other instances.

> Expressions should not change its data type/behavior after it's created
> ---
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30924) Add additional validation into Merge Into

2020-02-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30924:
---

Assignee: Burak Yavuz

> Add additional validation into Merge Into
> -
>
> Key: SPARK-30924
> URL: https://issues.apache.org/jira/browse/SPARK-30924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> Merge Into is currently missing additional validation around:
>  1. The lack of any WHEN statements
>  2. Single use of UPDATE/DELETE
>  3. The first WHEN MATCHED statement needs to have a condition if there are 
> two WHEN MATCHED statements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30924) Add additional validation into Merge Into

2020-02-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30924.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27677
[https://github.com/apache/spark/pull/27677]

> Add additional validation into Merge Into
> -
>
> Key: SPARK-30924
> URL: https://issues.apache.org/jira/browse/SPARK-30924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Merge Into is currently missing additional validation around:
>  1. The lack of any WHEN statements
>  2. Single use of UPDATE/DELETE
>  3. The first WHEN MATCHED statement needs to have a condition if there are 
> two WHEN MATCHED statements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30922) Remove the max split config after changing the multi sub joins to multi sub partitions

2020-02-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30922.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27673
[https://github.com/apache/spark/pull/27673]

> Remove the max split config after changing the multi sub joins to multi sub 
> partitions
> --
>
> Key: SPARK-30922
> URL: https://issues.apache.org/jira/browse/SPARK-30922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> After merged PR#27493, we not need the 
> "spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits" config 
> to resolve the ui issue when split more sub joins. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30922) Remove the max split config after changing the multi sub joins to multi sub partitions

2020-02-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30922:
---

Assignee: Ke Jia

> Remove the max split config after changing the multi sub joins to multi sub 
> partitions
> --
>
> Key: SPARK-30922
> URL: https://issues.apache.org/jira/browse/SPARK-30922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> After merged PR#27493, we not need the 
> "spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits" config 
> to resolve the ui issue when split more sub joins. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30936) Forwards-compatibility in JsonProtocol in broken

2020-02-23 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-30936:
-
Summary: Forwards-compatibility in JsonProtocol in broken  (was: Fix the 
broken forwards-compatibility in JsonProtocol)

> Forwards-compatibility in JsonProtocol in broken
> 
>
> Key: SPARK-30936
> URL: https://issues.apache.org/jira/browse/SPARK-30936
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> JsonProtocol is supposed to provide strong backwards-compatibility and 
> forwards-compatibility guarantees: any version of Spark should be able to 
> read JSON output written by any other version, including newer versions.
> However, the forwards-compatibility guarantee is broken for events parsed by 
> "ObjectMapper". If a new field is added to an event parsed by "ObjectMapper" 
> (e.g., 
> https://github.com/apache/spark/commit/6dc5921e66d56885b95c07e56e687f9f6c1eaca7#diff-dc5c7a41fbb7479cef48b67eb41ad254R33),
>  this event cannot be parsed by an old version of Spark History Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30936) Fix the broken forwards-compatibility in JsonProtocol

2020-02-23 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-30936:


 Summary: Fix the broken forwards-compatibility in JsonProtocol
 Key: SPARK-30936
 URL: https://issues.apache.org/jira/browse/SPARK-30936
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Shixiong Zhu


JsonProtocol is supposed to provide strong backwards-compatibility and 
forwards-compatibility guarantees: any version of Spark should be able to read 
JSON output written by any other version, including newer versions.

However, the forwards-compatibility guarantee is broken for events parsed by 
"ObjectMapper". If a new field is added to an event parsed by "ObjectMapper" 
(e.g., 
https://github.com/apache/spark/commit/6dc5921e66d56885b95c07e56e687f9f6c1eaca7#diff-dc5c7a41fbb7479cef48b67eb41ad254R33),
 this event cannot be parsed by an old version of Spark History Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds

2020-02-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30925:
---

Assignee: Maxim Gekk

> Overflow/round errors in conversions of milliseconds to/from microseconds
> -
>
> Key: SPARK-30925
> URL: https://issues.apache.org/jira/browse/SPARK-30925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Spark has special methods in DataTimeUtils for converting microseconds 
> from/to milliseconds - `fromMillis` and `toMillis()`. The methods handle 
> arithmetic overflow and round negative values. The ticket aims to review all 
> places in Spark SQL where microseconds are converted from/to milliseconds, 
> and replace them by util methods from DateTimeUtils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds

2020-02-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30925.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27676
[https://github.com/apache/spark/pull/27676]

> Overflow/round errors in conversions of milliseconds to/from microseconds
> -
>
> Key: SPARK-30925
> URL: https://issues.apache.org/jira/browse/SPARK-30925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark has special methods in DataTimeUtils for converting microseconds 
> from/to milliseconds - `fromMillis` and `toMillis()`. The methods handle 
> arithmetic overflow and round negative values. The ticket aims to review all 
> places in Spark SQL where microseconds are converted from/to milliseconds, 
> and replace them by util methods from DateTimeUtils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella

2020-02-23 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30923:
-
Component/s: PySpark

> Spark MLlib, GraphX 3.0 QA umbrella
> ---
>
> Key: SPARK-30923
> URL: https://issues.apache.org/jira/browse/SPARK-30923
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> Description
>  This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella

2020-02-23 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043150#comment-17043150
 ] 

zhengruifeng commented on SPARK-30923:
--

refering to pervious ticket https://issues.apache.org/jira/browse/SPARK-25319

> Spark MLlib, GraphX 3.0 QA umbrella
> ---
>
> Key: SPARK-30923
> URL: https://issues.apache.org/jira/browse/SPARK-30923
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> Description
>  This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella

2020-02-23 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30923:
-
Issue Type: Umbrella  (was: Task)

> Spark MLlib, GraphX 3.0 QA umbrella
> ---
>
> Key: SPARK-30923
> URL: https://issues.apache.org/jira/browse/SPARK-30923
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> Description
>  This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30935) Update MLlib, GraphX websites for 3.0

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30935:


 Summary: Update MLlib, GraphX websites for 3.0
 Key: SPARK-30935
 URL: https://issues.apache.org/jira/browse/SPARK-30935
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 3.0.0
Reporter: zhengruifeng


Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30934) ML, GraphX 3.0 QA: Programming guide update and migration guide

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30934:


 Summary: ML, GraphX 3.0 QA: Programming guide update and migration 
guide
 Key: SPARK-30934
 URL: https://issues.apache.org/jira/browse/SPARK-30934
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 3.0.0
Reporter: zhengruifeng


Before the release, we need to update the MLlib and GraphX Programming Guides. 
Updates will include:
 * Add migration guide subsection.
 ** Use the results of the QA audit JIRAs.
 * Check phrasing, especially in main sections (for outdated items such as "In 
this release, ...")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30933:


 Summary: ML, GraphX 3.0 QA: Update user guide for new features & 
APIs
 Key: SPARK-30933
 URL: https://issues.apache.org/jira/browse/SPARK-30933
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 3.0.0
Reporter: zhengruifeng


Check the user guide vs. a list of new APIs (classes, methods, data members) to 
see what items require updates to the user guide.

For each feature missing user guide doc:
 * Create a JIRA for that feature, and assign it to the author of the feature
 * Link it to (a) the original JIRA which introduced that feature ("related 
to") and (b) to this JIRA ("requires").

For MLlib:
 * This task does not include major reorganizations for the programming guide.
 * We should now begin copying algorithm details from the spark.mllib guide to 
spark.ml as needed, rather than just linking back to the corresponding 
algorithms in the spark.mllib user guide.

If you would like to work on this task, please comment, and we can create & 
link JIRAs for parts of this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30932:


 Summary: ML 3.0 QA: API: Java compatibility, docs
 Key: SPARK-30932
 URL: https://issues.apache.org/jira/browse/SPARK-30932
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Affects Versions: 3.0.0
Reporter: zhengruifeng


Check Java compatibility for this release:
 * APIs in {{spark.ml}}
 * New APIs in {{spark.mllib}} (There should be few, if any.)

Checking compatibility means:
 * Checking for differences in how Scala and Java handle types. Some items to 
look out for are:
 ** Check for generic "Object" types where Java cannot understand complex Scala 
types.
 *** *Note*: The Java docs do not always match the bytecode. If you find a 
problem, please verify it using {{javap}}.
 ** Check Scala objects (especially with nesting!) carefully. These may not be 
understood in Java, or they may be accessible only via the weirdly named Java 
types (with "$" or "#") which are generated by the Scala compiler.
 ** Check for uses of Scala and Java enumerations, which can show up oddly in 
the other language's doc. (In {{spark.ml}}, we have largely tried to avoid 
using enumerations, and have instead favored plain strings.)
 * Check for differences in generated Scala vs Java docs. E.g., one past issue 
was that Javadocs did not respect Scala's package private modifier.

If you find issues, please comment here, or for larger items, create separate 
JIRAs and link here as "requires".
 * Remember that we should not break APIs from previous releases. If you find a 
problem, check if it was introduced in this Spark release (in which case we can 
fix it) or in a previous one (in which case we can create a java-friendly 
version of the API).
 * If needed for complex issues, create small Java unit tests which execute 
each method. (Algorithmic correctness can be checked in Scala.)

Recommendations for how to complete this task:
 * There are not great tools. In the past, this task has been done by:
 ** Generating API docs
 ** Building JAR and outputting the Java class signatures for MLlib
 ** Manually inspecting and searching the docs and class signatures for issues
 * If you do have ideas for better tooling, please say so we can make this task 
easier in the future!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30931:


 Summary: ML 3.0 QA: API: Python API coverage
 Key: SPARK-30931
 URL: https://issues.apache.org/jira/browse/SPARK-30931
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
generated HTML doc and compare the Scala & Python versions.
 * *GOAL*: Audit and create JIRAs to fix in the next release.
 * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.

We need to track:
 * Inconsistency: Do class/method/parameter names match?
 * Docs: Is the Python doc missing or just a stub? We want the Python doc to be 
as complete as the Scala doc.
 * API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental. These must be recorded and added in the 
Migration Guide for this release.
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle. 
*Please use a _separate_ JIRA (linked below as "requires") for this list of 
to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30930:


 Summary: ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, 
final, sealed audit
 Key: SPARK-30930
 URL: https://issues.apache.org/jira/browse/SPARK-30930
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 3.0.0
Reporter: zhengruifeng


We should make a pass through the items marked as Experimental or DeveloperApi 
and see if any are stable enough to be unmarked.

We should also check for items marked final or sealed to see if they are stable 
enough to be opened up as APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30929) ML, GraphX 3.0 QA: API: New Scala APIs, docs

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30929:


 Summary: ML, GraphX 3.0 QA: API: New Scala APIs, docs
 Key: SPARK-30929
 URL: https://issues.apache.org/jira/browse/SPARK-30929
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 3.0.0
 Environment: Audit new public Scala APIs added to MLlib & GraphX. Take 
note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes

2020-02-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30928:


 Summary: ML, GraphX 3.0 QA: API: Binary incompatible changes
 Key: SPARK-30928
 URL: https://issues.apache.org/jira/browse/SPARK-30928
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 3.0.0
Reporter: zhengruifeng


Generate a list of binary incompatible changes using MiMa and create new JIRAs 
for issues found. Filter out false positives as needed.

If you want to take this task, look at the analogous task from the previous 
release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella

2020-02-23 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043147#comment-17043147
 ] 

zhengruifeng commented on SPARK-30923:
--

[~smilegator] Sure!

> Spark MLlib, GraphX 3.0 QA umbrella
> ---
>
> Key: SPARK-30923
> URL: https://issues.apache.org/jira/browse/SPARK-30923
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> Description
>  This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30867) add FValueRegressionTest

2020-02-23 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30867:


Assignee: Huaxin Gao

> add FValueRegressionTest
> 
>
> Key: SPARK-30867
> URL: https://issues.apache.org/jira/browse/SPARK-30867
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Add FValueRegressionTest in ML.stat. This will be used for 
> FValueRegressionSelector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30867) add FValueRegressionTest

2020-02-23 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30867.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27623
[https://github.com/apache/spark/pull/27623]

> add FValueRegressionTest
> 
>
> Key: SPARK-30867
> URL: https://issues.apache.org/jira/browse/SPARK-30867
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.1.0
>
>
> Add FValueRegressionTest in ML.stat. This will be used for 
> FValueRegressionSelector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30923) Spark MLlib, GraphX 3.0 QA umbrella

2020-02-23 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30923:
-
Summary: Spark MLlib, GraphX 3.0 QA umbrella  (was: Spark MLlib, GraphX 2.4 
QA umbrella)

> Spark MLlib, GraphX 3.0 QA umbrella
> ---
>
> Key: SPARK-30923
> URL: https://issues.apache.org/jira/browse/SPARK-30923
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> Description
>  This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30927) StreamingQueryManager should avoid keeping reference to terminated StreamingQuery

2020-02-23 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-30927:


 Summary: StreamingQueryManager should avoid keeping reference to 
terminated StreamingQuery
 Key: SPARK-30927
 URL: https://issues.apache.org/jira/browse/SPARK-30927
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Shixiong Zhu


Right now StreamingQueryManager will keep the last terminated query until 
"resetTerminated" is called. When the last terminated query has lots of states 
(a large sql plan, cached RDDs, etc.), it will waste these memory unnecessarily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result

2020-02-23 Thread Bozhidar Karaargirov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bozhidar Karaargirov updated SPARK-30926:
-
Description: 
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color})

df.createTempView({color:#008000}"airQualityP"{color})

{color:#80}val {color}result = {color:#660e7a}session{color} 
.sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color})

println(result.count())

 

And this is how I transform the csv into parquets:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color})
 .csv({color:#660e7a}originalDataset{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

df.write.parquet({color:#660e7a}bigParquetDataset{color})

 

These are the two mapping functions:

{color:#80}val {color}{color:#660e7a}mappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getString({color:#ff}1{color}),
 r.getString({color:#ff}2{color}),
 r.getString({color:#ff}3{color}),
 r.getString({color:#ff}4{color}),
 r.getString({color:#ff}5{color}),
 {
 {color:#80}val {color}p1 = r.getString({color:#ff}6{color})
 {color:#80}if{color}(p1 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble
 },
 {
 {color:#80}val {color}p2 = r.getString({color:#ff}7{color})
 {color:#80}if{color}(p2 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble
 }
 ) }

{color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}),
 r.getAs[Double]({color:#008000}"P1"{color}),
 r.getAs[Double]({color:#008000}"P2"{color})
 )
 }

 

If it matters this is the paths (Note that I actually use double \ instead of / 
since it is windows - but that doesn't really matter):

{color:#80}val {color}{color:#660e7a}originalDataset {color}= 
{color:#008000}"D:/source/datasets/sofia-air-quality-dataset/*{color}{color:#008000}*sds**.csv"{color}

{color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= 
{color:#008000}"D:/source/datasets/air-tests/all-parquet"{color}

 

The count from the csvs I get is: 33934609

While the count from the parquets is: 35739394

 

 

  was:
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 

[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result

2020-02-23 Thread Bozhidar Karaargirov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bozhidar Karaargirov updated SPARK-30926:
-
Description: 
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color})

df.createTempView({color:#008000}"airQualityP"{color})

{color:#80}val {color}result = {color:#660e7a}session{color} 
.sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color})

println(result.count())

 

And this is how I transform the csv into parquets:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color})
 .csv({color:#660e7a}originalDataset{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

df.write.parquet({color:#660e7a}bigParquetDataset{color})

 

These are the two mapping functions:

{color:#80}val {color}{color:#660e7a}mappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getString({color:#ff}1{color}),
 r.getString({color:#ff}2{color}),
 r.getString({color:#ff}3{color}),
 r.getString({color:#ff}4{color}),
 r.getString({color:#ff}5{color}),
 {
 {color:#80}val {color}p1 = r.getString({color:#ff}6{color})
 {color:#80}if{color}(p1 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble
 },
 {
 {color:#80}val {color}p2 = r.getString({color:#ff}7{color})
 {color:#80}if{color}(p2 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble
 }
 ) }

{color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}),
 r.getAs[Double]({color:#008000}"P1"{color}),
 r.getAs[Double]({color:#008000}"P2"{color})
 )
 }

 

If it matters this is the paths (Note that I actually use \\ instead of / since 
it is windows - but that doesn't really matter):

{color:#80}val {color}{color:#660e7a}originalDataset {color}= 
{color:#008000}"D:/source/datasets/sofia-air-quality-dataset/*{color}{color:#008000}*sds**.csv"{color}

{color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= 
{color:#008000}"D:/source/datasets/air-tests/all-parquet"{color}

 

The count from the csvs I get is: 33934609

While the count from the parquets is: 35739394

 

 

  was:
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 

[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result

2020-02-23 Thread Bozhidar Karaargirov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bozhidar Karaargirov updated SPARK-30926:
-
Description: 
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color})

df.createTempView({color:#008000}"airQualityP"{color})

{color:#80}val {color}result = {color:#660e7a}session{color} 
.sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color})

println(result.count())

 

And this is how I transform the csv into parquets:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color})
 .csv({color:#660e7a}originalDataset{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

df.write.parquet({color:#660e7a}bigParquetDataset{color})

 

These are the two mapping functions:

{color:#80}val {color}{color:#660e7a}mappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getString({color:#ff}1{color}),
 r.getString({color:#ff}2{color}),
 r.getString({color:#ff}3{color}),
 r.getString({color:#ff}4{color}),
 r.getString({color:#ff}5{color}),
 {
 {color:#80}val {color}p1 = r.getString({color:#ff}6{color})
 {color:#80}if{color}(p1 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble
 },
 {
 {color:#80}val {color}p2 = r.getString({color:#ff}7{color})
 {color:#80}if{color}(p2 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble
 }
 ) }

{color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}),
 r.getAs[Double]({color:#008000}"P1"{color}),
 r.getAs[Double]({color:#008000}"P2"{color})
 )
 }

 

If it matters this is the paths:

{color:#80}val {color}{color:#660e7a}originalDataset {color}= 
{color:#008000}"D:/source/datasets/sofia-air-quality-dataset/*{color}{color:#008000}*sds**.csv"{color}

{color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= 
{color:#008000}"D:/source/datasets/air-tests/all-parquet"{color}

 

The count from the csvs I get is: 33934609

While the count from the parquets is: 35739394

 

 

  was:
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 

[jira] [Updated] (SPARK-30926) Same SQL on CSV and on Parquet gives different result

2020-02-23 Thread Bozhidar Karaargirov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bozhidar Karaargirov updated SPARK-30926:
-
Description: 
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color})

df.createTempView({color:#008000}"airQualityP"{color})

{color:#80}val {color}result = {color:#660e7a}session{color} 
.sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color})

println(result.count())

 

And this is how I transform the csv into parquets:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color})
 .csv({color:#660e7a}originalDataset{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

df.write.parquet({color:#660e7a}bigParquetDataset{color})

 

These are the two mapping functions:

{color:#80}val {color}{color:#660e7a}mappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getString({color:#ff}1{color}),
 r.getString({color:#ff}2{color}),
 r.getString({color:#ff}3{color}),
 r.getString({color:#ff}4{color}),
 r.getString({color:#ff}5{color}),
 {
 {color:#80}val {color}p1 = r.getString({color:#ff}6{color})
 {color:#80}if{color}(p1 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p1.toDouble
 },
 {
 {color:#80}val {color}p2 = r.getString({color:#ff}7{color})
 {color:#80}if{color}(p2 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN{color} {color:#80}else {color}p2.toDouble
 }
 ) }

{color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}),
 r.getAs[Double]({color:#008000}"P1"{color}),
 r.getAs[Double]({color:#008000}"P2"{color})
 )
 }

 

If it matters this is the paths:

{color:#80}val {color}{color:#660e7a}originalDataset {color}= 
{color:#008000}"D:\{color}{color:#008000}source\{color}{color:#008000}datasets\{color}{color:#008000}sofia-air-quality-dataset\*{color}{color:#008000}*sds**.csv"{color}

{color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= 
{color:#008000}"D:\{color}{color:#008000}source\{color}{color:#008000}datasets\{color}{color:#008000}air-tests\{color}{color:#008000}all-parquet"{color}

 

The count from the csvs I get is: 33934609

While the count from the parquets is: 35739394

 

 

  was:
SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 

[jira] [Created] (SPARK-30926) Same SQL on CSV and on Parquet gives different result

2020-02-23 Thread Bozhidar Karaargirov (Jira)
Bozhidar Karaargirov created SPARK-30926:


 Summary: Same SQL on CSV and on Parquet gives different result
 Key: SPARK-30926
 URL: https://issues.apache.org/jira/browse/SPARK-30926
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
 Environment: I run this locally on a windows 10 machine.

The java runtime is:


{color:#cc}openjdk 11.0.5 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode){color}
Reporter: Bozhidar Karaargirov


SO I played around with a data set from here: 
[https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset]

I ran the same query for the base CSVs and against a parquet version of them:

{color:#008000}SELECT * FROM airQualityP WHERE P1 > 20{color}

Here is the csv code:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).csv({color:#660e7a}originalDataset{color})

df.createTempView({color:#008000}"airQuality"{color})

{color:#80}val {color}result = 
{color:#660e7a}session{color}.sql({color:#008000}"SELECT * FROM airQuality 
WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

println(result.count())

 

Here is the parquet code:

 

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color}).parquet({color:#660e7a}bigParquetDataset{color})

df.createTempView({color:#008000}"airQualityP"{color})

{color:#80}val {color}result = {color:#660e7a}session
{color} .sql({color:#008000}"SELECT * FROM airQualityP WHERE P1 > 20"{color})
 .map(ParticleAirQuality.{color:#660e7a}namedMappingFunction{color})

println(result.count())

 

And this is how I transform the csv into parquets:

{color:#80}import 
{color}{color:#660e7a}session{color}.{color:#660e7a}sqlContext{color}.implicits._

{color:#80}val {color}df = 
{color:#660e7a}session{color}.read.option({color:#008000}"header"{color}, 
{color:#008000}"true"{color})
 .csv({color:#660e7a}originalDataset{color})
 .map(ParticleAirQuality.{color:#660e7a}mappingFunction{color})

df.write.parquet({color:#660e7a}bigParquetDataset{color})

 

These are the two mapping functions:

{color:#80}val {color}{color:#660e7a}mappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getString({color:#ff}1{color}),
 r.getString({color:#ff}2{color}),
 r.getString({color:#ff}3{color}),
 r.getString({color:#ff}4{color}),
 r.getString({color:#ff}5{color}),
 {
 {color:#80}val {color}p1 = r.getString({color:#ff}6{color})
 {color:#80}if{color}(p1 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN
{color} {color:#80}else {color}p1.toDouble
 },
 {
 {color:#80}val {color}p2 = r.getString({color:#ff}7{color})
 {color:#80}if{color}(p2 == {color:#80}null{color}) 
Double.{color:#660e7a}NaN
{color} {color:#80}else {color}p2.toDouble
 }
 ) }

{color:#80}val {color}{color:#660e7a}namedMappingFunction {color}= {
 r: Row => ParticleAirQuality(
 r.getAs[{color:#20999d}String{color}]({color:#008000}"sensor_id"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"location"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lat"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"lon"{color}),
 r.getAs[{color:#20999d}String{color}]({color:#008000}"timestamp"{color}),
 r.getAs[Double]({color:#008000}"P1"{color}),
 r.getAs[Double]({color:#008000}"P2"{color})
 )
}

 

If it matters this is the paths:

{color:#80}val {color}{color:#660e7a}originalDataset {color}= 
{color:#008000}"D:{color}{color:#80}\\{color}{color:#008000}source{color}{color:#80}\\{color}{color:#008000}datasets{color}{color:#80}\\{color}{color:#008000}sofia-air-quality-dataset{color}{color:#80}\\{color}{color:#008000}*sds*.csv"
{color}{color:#80}val {color}{color:#660e7a}bigParquetDataset {color}= 
{color:#008000}"D:{color}{color:#80}\\{color}{color:#008000}source{color}{color:#80}\\{color}{color:#008000}datasets{color}{color:#80}\\{color}{color:#008000}air-tests{color}{color:#80}\\{color}{color:#008000}all-parquet"{color}

 

The count from the csvs I get is: 33934609

While the count from the parquets is: 35739394

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-23 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield reopened SPARK-30332:
-

Added code for reproduce

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, 
> T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,

[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-23 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042851#comment-17042851
 ] 

Izek Greenfield commented on SPARK-30332:
-

Code to reproduce the problem:

{code:scala}

import java.nio.file.{Files, Paths}

import org.apache.spark.sql.SparkSession

object Test {

  def main(args: Array[String]): Unit = {
val spark = {
  SparkSession
.builder()
.master("local[*]")
.config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.cbo.enabled", "true")
.config("spark.scheduler.mode", "FAIR")
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.sql.parquet.filterPushdown", "true")
.config("spark.sql.shuffle.partitions", "500")
.config("spark.executor.heartbeatInterval", "600s")
.config("spark.network.timeout", "1200s")
.config("spark.sql.broadcastTimeout", "1200s")
.config("spark.shuffle.file.buffer", "64k")
.appName("error")
.enableHiveSupport()
.getOrCreate()
}

val pathToCsvFiles = "db"
import scala.collection.JavaConverters._

Files.walk(Paths.get(pathToCsvFiles)).iterator().asScala.map(_.toFile).foreach{ 
file =>
  if (!file.isDirectory){
val name = file.getName
spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .option("mode", "DROPMALFORMED")
  .load(file.getAbsolutePath)
  .createOrReplaceGlobalTempView(name.split("\\.").head)
  }
}

spark.sql(
  """
|SELECT  BT_capital.asof_date,
|BT_capital.run_id,
|BT_capital.v,
|BT_capital.id,
|BT_capital.entity,
|BT_capital.level_1,
|BT_capital.level_2,
|BT_capital.level_3,
|BT_capital.level_4,
|BT_capital.level_5,
|BT_capital.level_6,
|BT_capital.path_bt_capital,
|BT_capital.line_item,
|t0.target_line_item,
|t0.line_description,
|BT_capital.col_item,
|BT_capital.rep_amount,
|root.orgUnitId,
|root.cptyId,
|root.instId,
|root.startDate,
|root.maturityDate,
|root.amount,
|root.nominalAmount,
|root.quantity,
|root.lkupAssetLiability,
|root.lkupCurrency,
|root.lkupProdType,
|root.interestResetDate,
|root.interestResetTerm,
|root.noticePeriod,
|root.historicCostAmount,
|root.dueDate,
|root.lkupResidence,
|root.lkupCountryOfUltimateRisk,
|root.lkupSector,
|root.lkupIndustry,
|root.lkupAccountingPortfolioType,
|root.lkupLoanDepositTerm,
|root.lkupFixedFloating,
|root.lkupCollateralType,
|root.lkupRiskType,
|root.lkupEligibleRefinancing,
|root.lkupHedging,
|root.lkupIsOwnIssued,
|root.lkupIsSubordinated,
|root.lkupIsQuoted,
|root.lkupIsSecuritised,
|root.lkupIsSecuritisedServiced,
|root.lkupIsSyndicated,
|root.lkupIsDeRecognised,
|root.lkupIsRenegotiated,
|root.lkupIsTransferable,
|root.lkupIsNewBusiness,
|root.lkupIsFiduciary,
|root.lkupIsNonPerforming,
|root.lkupIsInterGroup,
|root.lkupIsIntraGroup,
|root.lkupIsRediscounted,
|root.lkupIsCollateral,
|root.lkupIsExercised,
|root.lkupIsImpaired,
|root.facilityId,
|root.lkupIsOTC,
|root.lkupIsDefaulted,
|root.lkupIsSavingsPosition,
|root.lkupIsForborne,
|root.lkupIsDebtRestructuringLoan,
|root.interestRateAAR,
|root.interestRateAPRC,
|root.custom1,
|root.custom2,
|root.custom3,
|root.lkupSecuritisationType,
|root.lkupIsCashPooling,
|root.lkupIsEquityParticipationGTE10,
|root.lkupIsConvertible,
|root.lkupEconomicHedge,
|root.lkupIsNonCurrHeldForSale,
|root.lkupIsEmbeddedDerivative,
|root.lkupLoanPurpose,
|root.lkupRegulated,
|root.lkupRepaymentType,
|root.glAccount,
|root.lkupIsRecourse,
|root.lkupIsNotFullyGuaranteed,
|root.lkupImpairmentStage,
|root.lkupIsEntireAmountWrittenOff,
|root.lkupIsLowCreditRisk,
|root.lkupIsOBSWithinIFRS9,
|root.lkupIsUnderSpecialSurveillance,
|root.lkupProtection,
|root.lkupIsGeneralAllowance,
|root.lkupSectorUltimateRisk,
|root.cptyOrgUnitId,
|root.name,
|root.lkupNationality,
|root.lkupSize,
|root.lkupIsSPV,
|root.lkupIsCentralCounterparty,
|root.lkupIsMMRMFI,
|root.lkupIsKeyManagement,
|root.lkupIsOtherRelatedParty,

[jira] [Created] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds

2020-02-23 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30925:
--

 Summary: Overflow/round errors in conversions of milliseconds 
to/from microseconds
 Key: SPARK-30925
 URL: https://issues.apache.org/jira/browse/SPARK-30925
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Spark has special methods in DataTimeUtils for converting microseconds from/to 
milliseconds - `fromMillis` and `toMillis()`. The methods handle arithmetic 
overflow and round negative values. The ticket aims to review all places in 
Spark SQL where microseconds are converted from/to milliseconds, and replace 
them by util methods from DateTimeUtils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30844) Static partition should also follow StoreAssignmentPolicy when insert into table

2020-02-23 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30844.
--
Fix Version/s: 3.0.0
 Assignee: wuyi
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/27597]

> Static partition should also follow StoreAssignmentPolicy when insert into 
> table
> 
>
> Key: SPARK-30844
> URL: https://issues.apache.org/jira/browse/SPARK-30844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Static partition, currently, use common cast whatever the 
> StoreAssignmentPolicy is. We should make it also follow the 
> StoreAssignmentPolicy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30822) Pyspark queries fail if terminated with a semicolon

2020-02-23 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-30822:
-
Flags:   (was: Patch)

> Pyspark queries fail if terminated with a semicolon
> ---
>
> Key: SPARK-30822
> URL: https://issues.apache.org/jira/browse/SPARK-30822
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Samuel Setegne
>Priority: Minor
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> When a user submits a directly executable SQL statement terminated with a 
> semicolon, they receive a 
> `org.apache.spark.sql.catalyst.parser.ParseException` of `mismatched input 
> ";"`. SQL-92 describes a direct SQL statement as having the format of 
> ` ` and the majority of SQL 
> implementations either require the semicolon as a statement terminator, or 
> make it optional (meaning not raising an exception when it's included, 
> seemingly in recognition that it's a common behavior).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30822) Pyspark queries fail if terminated with a semicolon

2020-02-23 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-30822:
-
Labels:   (was: easyfix patch pull-request-available)

> Pyspark queries fail if terminated with a semicolon
> ---
>
> Key: SPARK-30822
> URL: https://issues.apache.org/jira/browse/SPARK-30822
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Samuel Setegne
>Priority: Minor
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> When a user submits a directly executable SQL statement terminated with a 
> semicolon, they receive a 
> `org.apache.spark.sql.catalyst.parser.ParseException` of `mismatched input 
> ";"`. SQL-92 describes a direct SQL statement as having the format of 
> ` ` and the majority of SQL 
> implementations either require the semicolon as a statement terminator, or 
> make it optional (meaning not raising an exception when it's included, 
> seemingly in recognition that it's a common behavior).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org