[jira] [Created] (SPARK-23544) Remove repartition operation from join in the optimizer

2018-02-28 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23544:
-

 Summary: Remove repartition operation from join in the optimizer
 Key: SPARK-23544
 URL: https://issues.apache.org/jira/browse/SPARK-23544
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, when the children of the join are Repartition or 
RepartitionByExpression, Repartition operation is not necessary, I think that 
we can remove the Repartition operation in the Optimizer, and it is safe for 
the join operation. now the explain looks like:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseRepartition 
===
 
Input LogicalPlan:
Join Inner
:- Repartition 10, false
: +- LocalRelation , [a#0, b#1]
+- Repartition 10, false
 +- LocalRelation , [c#2, d#3]

Output LogicalPlan:
Join Inner
:- LocalRelation , [a#0, b#1]
+- LocalRelation , [c#2, d#3]

 
h3. and I have add a test case:


val N = 2 << 20
runJoinBenchmark("sort merge join", N) {
 val df1 = sparkSession.range(N)
 .selectExpr(s"(id * 15485863) % ${N*10} as k1")
 val df2 = sparkSession.range(N)
 .selectExpr(s"(id * 15485867) % ${N*10} as k2")
 val df = df1.join(df2.repartition(20), col("k1") === col("k2"))
 
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
 df.count()
}

 

To test the performance of the following:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative

sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X
sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23543) Automatic Module creation fails in Java 9

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23543:


Assignee: (was: Apache Spark)

> Automatic Module creation fails in Java 9
> -
>
> Key: SPARK-23543
> URL: https://issues.apache.org/jira/browse/SPARK-23543
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
> Environment: maven + jdk9 + project based on jdk9 module system
>Reporter: Brian D Chambers
>Priority: Major
>
> When adding Spark to a Java 9 project that is utilizing the new jdk9 module 
> system, Spark components cannot be used because the automatic module names 
> that are generated by the jdk9 fail if the artifact has digits in what would 
> become the beginning of an identifier.  The jdk cannot generate an automatic 
> name for the Spark module, resulting in Spark being unusable from within a 
> java module.
> This problem can also be validated/tested on the command line against any 
> Spark jar, e.g.
> {panel:title=jar --file=spark-graphx_2.11-2.3.0.jar --describe-module}
> Unable to derive module descriptor for: spark-graphx_2.11-2.3.0.jar 
> spark.graphx.2.11: Invalid module name: '2' is not a Java identifier
> {panel}
> Spark does not have to support jdk9 modules to fix this issue.  It just needs 
> to add a line of metadata to its manifest so the jdk can generate a valid 
> automatic name.
> the following would be sufficient to fix the issue in spark.graphx
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-jar-plugin
>   
> 
>   
> spark.graphx
>   
> 
>   
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23543) Automatic Module creation fails in Java 9

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23543:


Assignee: Apache Spark

> Automatic Module creation fails in Java 9
> -
>
> Key: SPARK-23543
> URL: https://issues.apache.org/jira/browse/SPARK-23543
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
> Environment: maven + jdk9 + project based on jdk9 module system
>Reporter: Brian D Chambers
>Assignee: Apache Spark
>Priority: Major
>
> When adding Spark to a Java 9 project that is utilizing the new jdk9 module 
> system, Spark components cannot be used because the automatic module names 
> that are generated by the jdk9 fail if the artifact has digits in what would 
> become the beginning of an identifier.  The jdk cannot generate an automatic 
> name for the Spark module, resulting in Spark being unusable from within a 
> java module.
> This problem can also be validated/tested on the command line against any 
> Spark jar, e.g.
> {panel:title=jar --file=spark-graphx_2.11-2.3.0.jar --describe-module}
> Unable to derive module descriptor for: spark-graphx_2.11-2.3.0.jar 
> spark.graphx.2.11: Invalid module name: '2' is not a Java identifier
> {panel}
> Spark does not have to support jdk9 modules to fix this issue.  It just needs 
> to add a line of metadata to its manifest so the jdk can generate a valid 
> automatic name.
> the following would be sufficient to fix the issue in spark.graphx
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-jar-plugin
>   
> 
>   
> spark.graphx
>   
> 
>   
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23437:


Assignee: Apache Spark

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Assignee: Apache Spark
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23266) Matrix Inversion on BlockMatrix

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381604#comment-16381604
 ] 

Apache Spark commented on SPARK-23266:
--

Hello,which version this issue will be release ?

> Matrix Inversion on BlockMatrix
> ---
>
> Key: SPARK-23266
> URL: https://issues.apache.org/jira/browse/SPARK-23266
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Chandan Misra
>Priority: Minor
>
> Matrix inversion is the basic building block for many other algorithms like 
> regression, classification, geostatistical analysis using ordinary kriging 
> etc. A simple Spark BlockMatrix based efficient distributed 
> divide-and-conquer algorithm can be implemented using only *6* 
> multiplications in each recursion level of the algorithm. The reference paper 
> can be found in
> [https://arxiv.org/abs/1801.04723]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23389) When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, we should be able to use serialized sorting.

2018-02-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23389:
---

Assignee: liuxian

> When the shuffle dependency specifies aggregation ,and 
> `dependency.mapSideCombine=false`,  we should be able to use serialized 
> sorting.
> ---
>
> Key: SPARK-23389
> URL: https://issues.apache.org/jira/browse/SPARK-23389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Major
> Fix For: 2.4.0
>
>
> When the shuffle dependency specifies aggregation ,and 
> `dependency.mapSideCombine=false`, in the map side,there is no need for 
> aggregation and sorting, so we should be able to use serialized sorting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23389) When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, we should be able to use serialized sorting.

2018-02-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23389.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20576
[https://github.com/apache/spark/pull/20576]

> When the shuffle dependency specifies aggregation ,and 
> `dependency.mapSideCombine=false`,  we should be able to use serialized 
> sorting.
> ---
>
> Key: SPARK-23389
> URL: https://issues.apache.org/jira/browse/SPARK-23389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Major
> Fix For: 2.4.0
>
>
> When the shuffle dependency specifies aggregation ,and 
> `dependency.mapSideCombine=false`, in the map side,there is no need for 
> aggregation and sorting, so we should be able to use serialized sorting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23542) The `where exists' action in optimized logical plan should be optimized

2018-02-28 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23542:
--
Description: 
The optimized logical plan of query '*select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)*' is :
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

But the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*` is :
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

 So i think the  optimized logical plan of '*select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

  was:
The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)' is :

 
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 


> The `where exists' action in optimized logical plan should be optimized 
> 
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query '*select * from tt1 where exists (select 
> *  from tt2  where tt1.i = tt2.i)*' is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  
> But the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*` is 
> :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of '*select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23543) Automatic Module creation fails in Java 9

2018-02-28 Thread Brian D Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian D Chambers updated SPARK-23543:
-
Description: 
When adding Spark to a Java 9 project that is utilizing the new jdk9 module 
system, Spark components cannot be used because the automatic module names that 
are generated by the jdk9 fail if the artifact has digits in what would become 
the beginning of an identifier.  The jdk cannot generate an automatic name for 
the Spark module, resulting in Spark being unusable from within a java module.

This problem can also be validated/tested on the command line against any Spark 
jar, e.g.
{panel:title=jar --file=spark-graphx_2.11-2.3.0.jar --describe-module}
Unable to derive module descriptor for: spark-graphx_2.11-2.3.0.jar 
spark.graphx.2.11: Invalid module name: '2' is not a Java identifier
{panel}
Spark does not have to support jdk9 modules to fix this issue.  It just needs 
to add a line of metadata to its manifest so the jdk can generate a valid 
automatic name.

the following would be sufficient to fix the issue in spark.graphx
{code:java}

  org.apache.maven.plugins
  maven-jar-plugin
  

  
spark.graphx
  

  

{code}
 

  was:
When adding Spark to a Java 9 project that is utilizing the new jdk9 module 
system, Spark components cannot be used because the automatic module names that 
are generated by the jdk9 are based on the JAR.  When the JAR has chars not 
recognized by jdk9 as valid for identifiers (in this case, digits) at the start 
of a part of the token, the jdk9 fails to autogenerate a module name.

This problem can also be validated/tested on the command line against any Spark 
jar, e.g.
{panel:title=jar --file=spark-graphx_2.11-2.3.0.jar --describe-module}
Unable to derive module descriptor for: spark-graphx_2.11-2.3.0.jar 
spark.graphx.2.11: Invalid module name: '2' is not a Java identifier
{panel}
Spark does not have to support jdk9 modules to fix this issue.  It just needs 
to add a line of metadata to its manifest so the jdk can generate a valid 
automatic name.

the following would be sufficient to fix the issue in spark.graphx
{code:java}

  org.apache.maven.plugins
  maven-jar-plugin
  

  
spark.graphx
  

  

{code}
 


> Automatic Module creation fails in Java 9
> -
>
> Key: SPARK-23543
> URL: https://issues.apache.org/jira/browse/SPARK-23543
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
> Environment: maven + jdk9 + project based on jdk9 module system
>Reporter: Brian D Chambers
>Priority: Major
>
> When adding Spark to a Java 9 project that is utilizing the new jdk9 module 
> system, Spark components cannot be used because the automatic module names 
> that are generated by the jdk9 fail if the artifact has digits in what would 
> become the beginning of an identifier.  The jdk cannot generate an automatic 
> name for the Spark module, resulting in Spark being unusable from within a 
> java module.
> This problem can also be validated/tested on the command line against any 
> Spark jar, e.g.
> {panel:title=jar --file=spark-graphx_2.11-2.3.0.jar --describe-module}
> Unable to derive module descriptor for: spark-graphx_2.11-2.3.0.jar 
> spark.graphx.2.11: Invalid module name: '2' is not a Java identifier
> {panel}
> Spark does not have to support jdk9 modules to fix this issue.  It just needs 
> to add a line of metadata to its manifest so the jdk can generate a valid 
> automatic name.
> the following would be sufficient to fix the issue in spark.graphx
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-jar-plugin
>   
> 
>   
> spark.graphx
>   
> 
>   
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23543) Automatic Module creation fails in Java 9

2018-02-28 Thread Brian D Chambers (JIRA)
Brian D Chambers created SPARK-23543:


 Summary: Automatic Module creation fails in Java 9
 Key: SPARK-23543
 URL: https://issues.apache.org/jira/browse/SPARK-23543
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.0
 Environment: maven + jdk9 + project based on jdk9 module system
Reporter: Brian D Chambers


When adding Spark to a Java 9 project that is utilizing the new jdk9 module 
system, Spark components cannot be used because the automatic module names that 
are generated by the jdk9 are based on the JAR.  When the JAR has chars not 
recognized by jdk9 as valid for identifiers (in this case, digits) at the start 
of a part of the token, the jdk9 fails to autogenerate a module name.

This problem can also be validated/tested on the command line against any Spark 
jar, e.g.
{panel:title=jar --file=spark-graphx_2.11-2.3.0.jar --describe-module}
Unable to derive module descriptor for: spark-graphx_2.11-2.3.0.jar 
spark.graphx.2.11: Invalid module name: '2' is not a Java identifier
{panel}
Spark does not have to support jdk9 modules to fix this issue.  It just needs 
to add a line of metadata to its manifest so the jdk can generate a valid 
automatic name.

the following would be sufficient to fix the issue in spark.graphx
{code:java}

  org.apache.maven.plugins
  maven-jar-plugin
  

  
spark.graphx
  

  

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23493) insert-into depends on columns order, otherwise incorrect data inserted

2018-02-28 Thread Xiaoju Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoju Wu resolved SPARK-23493.
---
Resolution: Not A Bug

> insert-into depends on columns order, otherwise incorrect data inserted
> ---
>
> Key: SPARK-23493
> URL: https://issues.apache.org/jira/browse/SPARK-23493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiaoju Wu
>Priority: Minor
>
> insert-into only works when the partitionby key columns are set at last:
> val data = Seq(
>  (7, "test1", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  import spark.implicits._
> val table = "default.tbl"
>  spark
>  .createDataset(data)
>  .toDF("col1", "col2", "col3")
>  .write
>  .partitionBy("col1")
>  .saveAsTable(table)
> val data2 = Seq(
>  (7, "test2", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  spark
>  .createDataset(data2)
>  .toDF("col1", "col2", "col3")
>   .write
>  .insertInto(table)
> sql("select * from " + table).show()
> ++-++
> |col2|col3|col1|
> ++-++
> |test#test|0.0|8|
> |test1|1.0|7|
> |test3|0.0|9|
> |8|null|0|
> |9|null|0|
> |7|null|1|
> ++-++
>  
> If you try inserting with sql, the issue is the same.
> val data = Seq(
>  (7, "test1", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  import spark.implicits._
> val table = "default.tbl"
>  spark
>  .createDataset(data)
>  .toDF("col1", "col2", "col3")
>  .write
>  .partitionBy("col1")
>  .saveAsTable(table)
> val data2 = Seq(
>  (7, "test2", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
> sql("insert into " + table + " values(7,'test2',1.0)")
>  sql("select * from " + table).show()
> +--+---++
> |col2|col3|col1|
> +--+---++
> |test#test|0.0|8|
> |test1|1.0|7|
> |test3|0.0|9|
> |7|null|1|
> +--+---++
> No exception was thrown since I only run insertInto, not together with 
> partitionBy. The data are inserted incorrectly. The issue is related to 
> column order. If I change to partitionBy col3, which is the last column, it 
> works.
> val data = Seq(
>  (7, "test1", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  import spark.implicits._
> val table = "default.tbl"
>  spark
>  .createDataset(data)
>  .toDF("col1", "col2", "col3")
>  .write
>  .partitionBy("col3")
>  .saveAsTable(table)
> val data2 = Seq(
>  (7, "test2", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
> spark
>  .createDataset(data2)
>  .toDF("col1", "col2", "col3")
>  .write
>  .insertInto(table)
> sql("select * from " + table).show()
> +---+--++
> |col1|col2|col3|
> +---+--++
> |8|test#test|0.0|
> |9|test3|0.0|
> |8|test#test|0.0|
> |9|test3|0.0|
> |7|test1|1.0|
> |7|test2|1.0|
> +---+--++



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23540) The `where exists' action in optimized logical plan should be optimized

2018-02-28 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei resolved SPARK-23540.
---
Resolution: Duplicate

> The `where exists' action in optimized logical plan should be optimized 
> 
>
> Key: SPARK-23540
> URL: https://issues.apache.org/jira/browse/SPARK-23540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query 'select * from tt1 where exists (select * 
>  from tt2  where tt1.i = tt2.i);` is :
> >== Optimized Logical Plan ==
> >Join LeftSemi, (i#143 = i#145)
> >:- MetastoreRelation default, tt1
> >+- MetastoreRelation default, tt2
> But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
> >== Optimized Logical Plan ==
>  Join LeftSemi, (i#152 = i#150)
>  :- Filter isnotnull(i#150)
>  : +- MetastoreRelation default, tt1
>  +- Project [i#152|#152]
>  +- MetastoreRelation default, tt2
>  
>  So i think the  optimized logical plan of 'select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i);` should be further optimization.
>  
> == Optimized Logical Plan ==
>  Join LeftSemi, (i#143 = i#145)
>  :- MetastoreRelation default, tt1
>  +- MetastoreRelation default, tt2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23542) The `where exists' action in optimized logical plan should be optimized

2018-02-28 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23542:
--
Description: 
The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)' is :

 
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

  was:
The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)' is :

 
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

>== Optimized Logical Plan ==
 >Join LeftSemi, (i#143 = i#145)
 >:- MetastoreRelation default, tt1
 >+- MetastoreRelation default, tt2

But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]
{code}
 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 

 


> The `where exists' action in optimized logical plan should be optimized 
> 
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query 'select * from tt1 where exists (select * 
>  from tt2  where tt1.i = tt2.i)' is :
>  
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
> But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of 'select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i);` should be further optimization.
>  
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23542) The `where exists' action in optimized logical plan should be optimized

2018-02-28 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23542:
--
Description: 
The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)' is :

 
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

>== Optimized Logical Plan ==
 >Join LeftSemi, (i#143 = i#145)
 >:- MetastoreRelation default, tt1
 >+- MetastoreRelation default, tt2

But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]
{code}
 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 

 

  was:
The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i);` is :

>== Optimized Logical Plan ==
 >Join LeftSemi, (i#143 = i#145)
 >:- MetastoreRelation default, tt1
 >+- MetastoreRelation default, tt2

But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :

>== Optimized Logical Plan ==
 Join LeftSemi, (i#152 = i#150)
 :- Filter isnotnull(i#150)
 : +- MetastoreRelation default, tt1
 +- Project [i#152|#152]
 +- MetastoreRelation default, tt2

 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 

 


> The `where exists' action in optimized logical plan should be optimized 
> 
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query 'select * from tt1 where exists (select * 
>  from tt2  where tt1.i = tt2.i)' is :
>  
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
> >== Optimized Logical Plan ==
>  >Join LeftSemi, (i#143 = i#145)
>  >:- MetastoreRelation default, tt1
>  >+- MetastoreRelation default, tt2
> But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]
> {code}
>  
>  So i think the  optimized logical plan of 'select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i);` should be further optimization.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23526) KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only one offset in offset log

2018-02-28 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381408#comment-16381408
 ] 

Gabor Somogyi commented on SPARK-23526:
---

I've checked the code and the problem is similar just like in SPARK-19185.
 I would like to add a workaround for the mentioned ticket similar what 
[~mgrover_impala_3a38] previosly provided and close this one.
 [~cloud_fan] [~zsxwing] [~vanzin] what do you think?

 

> KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only 
> one offset in offset log
> ---
>
> Key: SPARK-23526
> URL: https://issues.apache.org/jira/browse/SPARK-23526
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
>
> See it failed in PR builder with error message:
> {code:java}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 676a8b08-c89b-450b-8cd8-fbf9868cd240, runId = 
> 46bb7aae-138b-420d-9b4f-44f42a2a4a0f] terminated with exception: Job aborted 
> due to stage failure: Task 0 in stage 163.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 163.0 (TID 799, localhost, executor driver): 
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) 
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:305)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:216)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>  at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>  at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchDataReader.next(KafkaMicroBatchReader.scala:353)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23542) The `where exists' action in optimized logical plan should be optimized

2018-02-28 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23542:
--
Description: 
The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i);` is :

>== Optimized Logical Plan ==
 >Join LeftSemi, (i#143 = i#145)
 >:- MetastoreRelation default, tt1
 >+- MetastoreRelation default, tt2

But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :

>== Optimized Logical Plan ==
 Join LeftSemi, (i#152 = i#150)
 :- Filter isnotnull(i#150)
 : +- MetastoreRelation default, tt1
 +- Project [i#152|#152]
 +- MetastoreRelation default, tt2

 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 

 

  was:
The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i);` is :

>== Optimized Logical Plan ==
>Join LeftSemi, (i#143 = i#145)
>:- MetastoreRelation default, tt1
>+- MetastoreRelation default, tt2

But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :

>== Optimized Logical Plan ==
 Join LeftSemi, (i#152 = i#150)
 :- Filter isnotnull(i#150)
 : +- MetastoreRelation default, tt1
 +- Project [i#152|#152]
 +- MetastoreRelation default, tt2

 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 

== Optimized Logical Plan ==
 Join LeftSemi, (i#143 = i#145)
 :- MetastoreRelation default, tt1
 +- MetastoreRelation default, tt2


> The `where exists' action in optimized logical plan should be optimized 
> 
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query 'select * from tt1 where exists (select * 
>  from tt2  where tt1.i = tt2.i);` is :
> >== Optimized Logical Plan ==
>  >Join LeftSemi, (i#143 = i#145)
>  >:- MetastoreRelation default, tt1
>  >+- MetastoreRelation default, tt2
> But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :
> >== Optimized Logical Plan ==
>  Join LeftSemi, (i#152 = i#150)
>  :- Filter isnotnull(i#150)
>  : +- MetastoreRelation default, tt1
>  +- Project [i#152|#152]
>  +- MetastoreRelation default, tt2
>  
>  So i think the  optimized logical plan of 'select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i);` should be further optimization.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2018-02-28 Thread Thilak Raj Balasubramanian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381396#comment-16381396
 ] 

Thilak Raj Balasubramanian commented on SPARK-21918:


[~huLiu] This feature is a very important feature and we are waiting for this 
feature

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23542) The `where exists' action in optimized logical plan should be optimized

2018-02-28 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-23542:
-

 Summary: The `where exists' action in optimized logical plan 
should be optimized 
 Key: SPARK-23542
 URL: https://issues.apache.org/jira/browse/SPARK-23542
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: KaiXinXIaoLei


The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i);` is :

>== Optimized Logical Plan ==
>Join LeftSemi, (i#143 = i#145)
>:- MetastoreRelation default, tt1
>+- MetastoreRelation default, tt2

But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :

>== Optimized Logical Plan ==
 Join LeftSemi, (i#152 = i#150)
 :- Filter isnotnull(i#150)
 : +- MetastoreRelation default, tt1
 +- Project [i#152|#152]
 +- MetastoreRelation default, tt2

 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 

== Optimized Logical Plan ==
 Join LeftSemi, (i#143 = i#145)
 :- MetastoreRelation default, tt1
 +- MetastoreRelation default, tt2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23541:


Assignee: Tathagata Das  (was: Apache Spark)

> Allow Kafka source to read data with greater parallelism than the number of 
> topic-partitions
> 
>
> Key: SPARK-23541
> URL: https://issues.apache.org/jira/browse/SPARK-23541
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, when the Kafka source reads from Kafka, it generates as many tasks 
> as the number of partitions in the topic(s) to be read. In some case, it may 
> be beneficial to read the data with greater parallelism, that is, with more 
> number partitions/tasks. That means, offset ranges must be divided up into 
> smaller ranges such the number of records in partition ~= total records in 
> batch / desired partitions. This would also balance out any data skews 
> between topic-partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381363#comment-16381363
 ] 

Apache Spark commented on SPARK-23541:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20698

> Allow Kafka source to read data with greater parallelism than the number of 
> topic-partitions
> 
>
> Key: SPARK-23541
> URL: https://issues.apache.org/jira/browse/SPARK-23541
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, when the Kafka source reads from Kafka, it generates as many tasks 
> as the number of partitions in the topic(s) to be read. In some case, it may 
> be beneficial to read the data with greater parallelism, that is, with more 
> number partitions/tasks. That means, offset ranges must be divided up into 
> smaller ranges such the number of records in partition ~= total records in 
> batch / desired partitions. This would also balance out any data skews 
> between topic-partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23541:


Assignee: Apache Spark  (was: Tathagata Das)

> Allow Kafka source to read data with greater parallelism than the number of 
> topic-partitions
> 
>
> Key: SPARK-23541
> URL: https://issues.apache.org/jira/browse/SPARK-23541
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when the Kafka source reads from Kafka, it generates as many tasks 
> as the number of partitions in the topic(s) to be read. In some case, it may 
> be beneficial to read the data with greater parallelism, that is, with more 
> number partitions/tasks. That means, offset ranges must be divided up into 
> smaller ranges such the number of records in partition ~= total records in 
> batch / desired partitions. This would also balance out any data skews 
> between topic-partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions

2018-02-28 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23541:
-

 Summary: Allow Kafka source to read data with greater parallelism 
than the number of topic-partitions
 Key: SPARK-23541
 URL: https://issues.apache.org/jira/browse/SPARK-23541
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


Currently, when the Kafka source reads from Kafka, it generates as many tasks 
as the number of partitions in the topic(s) to be read. In some case, it may be 
beneficial to read the data with greater parallelism, that is, with more number 
partitions/tasks. That means, offset ranges must be divided up into smaller 
ranges such the number of records in partition ~= total records in batch / 
desired partitions. This would also balance out any data skews between 
topic-partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23540) The `where exists' action in optimized logical plan should be optimized

2018-02-28 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-23540:
-

 Summary: The `where exists' action in optimized logical plan 
should be optimized 
 Key: SPARK-23540
 URL: https://issues.apache.org/jira/browse/SPARK-23540
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: KaiXinXIaoLei


The optimized logical plan of query 'select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i);` is :

>== Optimized Logical Plan ==
>Join LeftSemi, (i#143 = i#145)
>:- MetastoreRelation default, tt1
>+- MetastoreRelation default, tt2

But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is :

>== Optimized Logical Plan ==
 Join LeftSemi, (i#152 = i#150)
 :- Filter isnotnull(i#150)
 : +- MetastoreRelation default, tt1
 +- Project [i#152|#152]
 +- MetastoreRelation default, tt2

 

 So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.

 

== Optimized Logical Plan ==
 Join LeftSemi, (i#143 = i#145)
 :- MetastoreRelation default, tt1
 +- MetastoreRelation default, tt2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23526) KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only one offset in offset log

2018-02-28 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381349#comment-16381349
 ] 

Gabor Somogyi commented on SPARK-23526:
---

Reminds me [this|https://github.com/apache/spark/pull/18234].
If nobody started I would like to work on it.


> KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only 
> one offset in offset log
> ---
>
> Key: SPARK-23526
> URL: https://issues.apache.org/jira/browse/SPARK-23526
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
>
> See it failed in PR builder with error message:
> {code:java}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 676a8b08-c89b-450b-8cd8-fbf9868cd240, runId = 
> 46bb7aae-138b-420d-9b4f-44f42a2a4a0f] terminated with exception: Job aborted 
> due to stage failure: Task 0 in stage 163.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 163.0 (TID 799, localhost, executor driver): 
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) 
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:305)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:216)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>  at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>  at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchDataReader.next(KafkaMicroBatchReader.scala:353)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23527) Error with spark-submit and kerberos with TLS-enabled Hadoop cluster

2018-02-28 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381258#comment-16381258
 ] 

Gabor Somogyi commented on SPARK-23527:
---

Yeah, I agree with Yuming. In the first case host not found. In the second case 
the server certificate is most probably self signed.

All you need to do is to add the server certificate to your trusted Java key 
store.

 

> Error with spark-submit and kerberos with TLS-enabled Hadoop cluster
> 
>
> Key: SPARK-23527
> URL: https://issues.apache.org/jira/browse/SPARK-23527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.1
> Environment: core-site.xml
> 
>     hadoop.security.key.provider.path
>     kms://ht...@host1.domain.com;host2.domain.com:16000/kms
> 
> hdfs-site.xml
> 
>     dfs.encryption.key.provider.uri
>     kms://ht...@host1.domain.com;host2.domain.com:16000/kms
> 
>Reporter: Ron Gonzalez
>Priority: Critical
>
> For current configuration of our enterprise cluster, I submit using 
> spark-submit:
> ./spark-submit --master yarn --deploy-mode cluster --class 
> org.apache.spark.examples.SparkPi --conf 
> spark.yarn.jars=hdfs:/user/user1/spark/lib/*.jar 
> ../examples/jars/spark-examples_2.11-2.2.1.jar 10
> I am getting the following problem:
>  
> 18/02/27 21:03:48 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
> 3351181 for svchdc236d on ha-hdfs:nameservice1
> Exception in thread "main" java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: host1.domain.com;host2.domain.com
>  at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
>  at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:825)
>  at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:781)
>  at 
> org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
>  at 
> org.apache.spark.deploy.yarn.security.HadoopFSCredentialProvider$$anonfun$obtainCredentials$1.apply(HadoopFSCredentialProvider.scala:52)
>  
> If I get rid of the other host for the properties so instead of 
> kms://ht...@host1.domain.com;host2.domain.com:16000/kms, I convert it to:
> kms://ht...@host1.domain.com:16000/kms
> it fails with a different error:
> java.io.IOException: javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target
> If I do the same thing using spark 1.6, it works so it seems like a 
> regression...
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver

2018-02-28 Thread Pratik Dhumal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381251#comment-16381251
 ] 

Pratik Dhumal commented on SPARK-23427:
---

Hello,
For the purpose of development plan,
 # Can we expect solution for this in near future?
 # Do you have any suggestion/patch or workaround to deal with issue?

I appreciate your help in this regard, thanks for your time.

 

 

> spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver 
> -
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
>Priority: Critical
>
> We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.
> With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver 
> memory used flat.
> With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
> goes up with rate depending upon the size of the autoBroadcastThreshold and 
> getting OOM exception. The problem is memory used by autoBroadcast is not 
> being free up in the driver.
> Application imports oracle tables as master dataframes which are persisted. 
> Each job applies filter to these tables and then registered them as 
> tempViewTable . Then sql query are using to process data further. At the end 
> all the intermediate dataFrame are unpersisted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18630) PySpark ML memory leak

2018-02-28 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381244#comment-16381244
 ] 

yogesh garg commented on SPARK-18630:
-

I would like to take this. If I understand correctly, moving the `__del__` and 
(deep) `copy` methods to `JavaWrapper` should address this potential issue. Is 
there a reason why we might not want to do a deep copy of `JavaWrapper` class?

> PySpark ML memory leak
> --
>
> Key: SPARK-18630
> URL: https://issues.apache.org/jira/browse/SPARK-18630
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it 
> would be good to follow up and address the potential memory leak for all 
> items handled by the `JavaWrapper`, not just `JavaParams`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/

2018-02-28 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan resolved SPARK-23083.

Resolution: Fixed

This has been merged, closing.

> Adding Kubernetes as an option to https://spark.apache.org/
> ---
>
> Key: SPARK-23083
> URL: https://issues.apache.org/jira/browse/SPARK-23083
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> [https://spark.apache.org/] can now include a reference to, and the k8s logo.
> I think this is not tied to the docs.
> cc/ [~rxin] [~sameer]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-02-28 Thread Ravneet Popli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381077#comment-16381077
 ] 

Ravneet Popli commented on SPARK-22942:
---

Matthew - Were you able to resolve this? We are also running into a similar 
problem and we know for sure that input arguments to our UDF cannot be null. If 
it helps, we are using Spark 2.0.2.

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>Priority: Major
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> *However*, when I try to filter `rightBreakdown`, it fails.
> {code:java}
> rightBreakdown.filter("has_black == true").show(false)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$hasBlack$1: (array>) => 
> boolean)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138)
>   at 
> 

[jira] [Created] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2018-02-28 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23539:
-

 Summary: Add support for Kafka headers in Structured Streaming
 Key: SPARK-23539
 URL: https://issues.apache.org/jira/browse/SPARK-23539
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das


Kafka headers were added in 0.11. We should expose them through our kafka data 
source in both batch and streaming queries. 

This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
SPARK-18057



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-02-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18057:
-
Summary: Update structured streaming kafka from 0.10.0.1 to 1.1.0  (was: 
Update structured streaming kafka from 10.0.1 to 10.2.0)

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23502) Support async init of spark context during spark-shell startup

2018-02-28 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380942#comment-16380942
 ] 

Sital Kedia commented on SPARK-23502:
-

>> what happens when you operate on {{sc}} before it's initialized?

We will wait for the future to complete before triggering any action based on 
user input, so that should be fine.

>> Is it surprising if Spark shell starts but errors out 20 seconds later? 

Yes, that might be one of the side effects.  Another major side effect as I 
mentioned is not able to print the messages like UI link and app id when the 
spark-shell starts.

 

I just wanted to get some opinion to see if this is something useful for the 
community. If we do not think so, we can close this.

 

 

 

> Support async init of spark context during spark-shell startup
> --
>
> Key: SPARK-23502
> URL: https://issues.apache.org/jira/browse/SPARK-23502
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Priority: Minor
>
> Currently, whenever a user starts the spark shell, we initialize the spark 
> context before returning the prompt to the user. In environments, where spark 
> context initialization takes several seconds, it is not a very good user 
> experience for the user to wait for the prompt. Instead of waiting for the 
> initialization of spark context, we can initialize it in the background while 
> we return the prompt to the user as soon as possible. Please note that even 
> if we return the prompt to the user soon, we still need to make sure to wait 
> for the spark context initialization to complete before any query is 
> executed. 
> Please note that the scala interpreter already does very similar async 
> initialization in order to return the prompt to the user faster - 
> https://github.com/scala/scala/blob/v2.12.2/src/repl/scala/tools/nsc/interpreter/ILoop.scala#L414.
>  We will be emulating the behavior for Spark. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23538) Simplify SSL configuration for https client

2018-02-28 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23538:
--

 Summary: Simplify SSL configuration for https client
 Key: SPARK-23538
 URL: https://issues.apache.org/jira/browse/SPARK-23538
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


There's code in {{SecurityManager}} that is used to configure SSL for the code 
that downloads dependencies from https servers:

{code}
  // SSL configuration for the file server. This is used by 
Utils.setupSecureURLConnection().
  val fileServerSSLOptions = getSSLOptions("fs")
  val (sslSocketFactory, hostnameVerifier) = if (fileServerSSLOptions.enabled) {
...
{code}

It was added for an old feature that doesn't exist anymore (the "file server" 
referenced in the comment), but can still be used to configure the built-in JRE 
SSL code with a custom trust store, for example.

We should instead:

- move this code out of SecurityManager, and place it where it's actually used 
({{Utils.setupSecureURLConnection}}.
- remove the dummy trust manager / host verifier since they don't make a lot of 
sense for the client code (and only made slightly more sense for the file 
server case).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23537) Logistic Regression without standardization

2018-02-28 Thread Jordi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordi updated SPARK-23537:
--
Affects Version/s: 2.0.2

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.2.1
>Reporter: Jordi
>Priority: Major
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
> 0.189389
> 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
> 0.480678
> 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
> 0.184529
> 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
> 0.154329
> {code}
> Am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23537) Logistic Regression without standardization

2018-02-28 Thread Jordi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordi updated SPARK-23537:
--
Priority: Major  (was: Minor)

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.2.1
>Reporter: Jordi
>Priority: Major
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
> 0.189389
> 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
> 0.480678
> 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
> 0.184529
> 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
> 0.154329
> {code}
> Am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23537) Logistic Regression without standardization

2018-02-28 Thread Jordi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordi updated SPARK-23537:
--
Description: 
I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer to 
not use standardization since all my features are binary, using the hashing 
trick (2^20 sparse vector).

I trained two models to compare results, I've been expecting to end with two 
similar models since it seems that internally the optimizer performs 
standardization and "de-standardization" (when it's deactivated) in order to 
improve the convergence.

Here you have the code I used:
{code:java}
val lr = new org.apache.spark.ml.classification.LogisticRegression()
.setRegParam(0.05)
.setElasticNetParam(0.0)
.setFitIntercept(true)
.setMaxIter(5000)
.setStandardization(false)

val model = lr.fit(data)
{code}
The results are disturbing me, I end with two significantly different models.

*Standardization:*

Training time: 8min.
Iterations: 37
Intercept: -4.386090107224499
Max weight: 4.724752299455218
Min weight: -3.560570478164854
Mean weight: -0.049325201841722795
l1 norm: 116710.39522171849
l2 norm: 402.2581552373957
Non zero weights: 128084
Non zero ratio: 0.12215042114257812

Last 10 LBFGS Val and Grand Norms:
{code:java}
18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
0.000559057
18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
0.000267527
18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
0.000205888
18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
0.000144173
18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
0.000140296
18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
0.000122709
18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
3.08789e-05
18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
2.23806e-05
18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
1.47422e-05
18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
2.37442e-05
{code}
*No standardization:*

Training time: 7h 14 min.
Iterations: 4992
Intercept: -4.216690468849263
Max weight: 0.41930559767624725
Min weight: -0.5949182537565524
Mean weight: -1.2659769019012E-6
l1 norm: 14.262025330648694
l2 norm: 1.2508777025612263
Non zero weights: 128955
Non zero ratio: 0.12298107147216797

Last 10 LBFGS Val and Grand Norms:
{code:java}
18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
0.217581
18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
0.185812
18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
0.214570
18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
0.489464
18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
0.178448
18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
0.172527
18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
0.189389
18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
0.480678
18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
0.184529
18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
0.154329
{code}
Am I missing something?

  was:
I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer to 
not use standardization since all my features are binary, using the hashing 
trick (2^20 sparse vector).

I trained two models to compare results, I've been expecting to end with two 
similar models since it seems that internally the optimizer performs 
standardization and "de-standardization" (when it's deactivated) in order to 
improve the convergence.

Here you have the code I used:
{code:java}
val lr = new org.apache.spark.ml.classification.LogisticRegression()
.setRegParam(0.05)
.setElasticNetParam(0.0)
.setFitIntercept(true)
.setMaxIter(5000)
.setStandardization(false)

val model = lr.fit(data)
{code}
The results are disturbing me, I end with two quite different models.

*Standardization:*

Training time: 8min.
Iterations: 37
Intercept: -4.386090107224499
Max weight: 4.724752299455218
Min weight: -3.560570478164854
Mean weight: -0.049325201841722795
l1 norm: 116710.39522171849
l2 norm: 402.2581552373957
Non zero weights: 128084
Non zero ratio: 0.12215042114257812

Last 10 LBFGS Val and Grand Norms:
{code:java}
18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
0.000559057
18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
0.000267527
18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
0.000205888
18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
0.000144173
18/02/27 17:15:04 

[jira] [Updated] (SPARK-23537) Logistic Regression without standardization

2018-02-28 Thread Jordi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordi updated SPARK-23537:
--
Description: 
I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer to 
not use standardization since all my features are binary, using the hashing 
trick (2^20 sparse vector).

I trained two models to compare results, I've been expecting to end with two 
similar models since it seems that internally the optimizer performs 
standardization and "de-standardization" (when it's deactivated) in order to 
improve the convergence.

Here you have the code I used:
{code:java}
val lr = new org.apache.spark.ml.classification.LogisticRegression()
.setRegParam(0.05)
.setElasticNetParam(0.0)
.setFitIntercept(true)
.setMaxIter(5000)
.setStandardization(false)

val model = lr.fit(data)
{code}
The results are disturbing me, I end with two quite different models.

*Standardization:*

Training time: 8min.
Iterations: 37
Intercept: -4.386090107224499
Max weight: 4.724752299455218
Min weight: -3.560570478164854
Mean weight: -0.049325201841722795
l1 norm: 116710.39522171849
l2 norm: 402.2581552373957
Non zero weights: 128084
Non zero ratio: 0.12215042114257812

Last 10 LBFGS Val and Grand Norms:
{code:java}
18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
0.000559057
18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
0.000267527
18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
0.000205888
18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
0.000144173
18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
0.000140296
18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
0.000122709
18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
3.08789e-05
18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
2.23806e-05
18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
1.47422e-05
18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
2.37442e-05
{code}
*No standardization:*

Training time: 7h 14 min.
Iterations: 4992
Intercept: -4.216690468849263
Max weight: 0.41930559767624725
Min weight: -0.5949182537565524
Mean weight: -1.2659769019012E-6
l1 norm: 14.262025330648694
l2 norm: 1.2508777025612263
Non zero weights: 128955
Non zero ratio: 0.12298107147216797

Last 10 LBFGS Val and Grand Norms:
{code:java}
18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
0.217581
18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
0.185812
18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
0.214570
18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
0.489464
18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
0.178448
18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
0.172527
18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
0.189389
18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
0.480678
18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
0.184529
18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
0.154329
{code}
Am I missing something?

  was:
I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer to 
not use standardization since all my features are binary.

I trained two models to compare results, I've been expecting to end with two 
similar models since it seems that internally the optimizer performs 
standardization and "de-standardization" (when it's deactivated) in order to 
improve the convergence.

Here you have the code I used:
{code}
val lr = new org.apache.spark.ml.classification.LogisticRegression()
.setRegParam(0.05)
.setElasticNetParam(0.0)
.setFitIntercept(true)
.setMaxIter(5000)
.setStandardization(false)

val model = lr.fit(data)
{code}
The results are disturbing me, I end with two quite different models.

Standardization:

Training time: 8min.
Iterations: 37
Intercept: -4.386090107224499
Max weight: 4.724752299455218
Min weight: -3.560570478164854
Mean weight: -0.049325201841722795
l1 norm: 116710.39522171849
l2 norm: 402.2581552373957
Non zero weights: 128084
Non zero ratio: 0.12215042114257812

Last 10 LBFGS Val and Grand Norms:
{code}
18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
0.000559057
18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
0.000267527
18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
0.000205888
18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
0.000144173
18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
0.000140296

[jira] [Updated] (SPARK-23537) Logistic Regression without standardization

2018-02-28 Thread Jordi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordi updated SPARK-23537:
--
Attachment: standardization.log

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.2.1
>Reporter: Jordi
>Priority: Minor
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary.
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two quite different models.
> Standardization:
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> No standardization:
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
> 0.189389
> 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
> 0.480678
> 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
> 0.184529
> 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
> 0.154329
> {code}
> Am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23537) Logistic Regression without standardization

2018-02-28 Thread Jordi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordi updated SPARK-23537:
--
Attachment: non-standardization.log

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.2.1
>Reporter: Jordi
>Priority: Minor
> Attachments: non-standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary.
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two quite different models.
> Standardization:
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> No standardization:
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
> 0.189389
> 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
> 0.480678
> 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
> 0.184529
> 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
> 0.154329
> {code}
> Am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23498) Accuracy problem in comparison with string and integer

2018-02-28 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380722#comment-16380722
 ] 

Kazuaki Ishizaki commented on SPARK-23498:
--

Is SPARK-21774 related to this issue, too?

> Accuracy problem in comparison with string and integer
> --
>
> Key: SPARK-23498
> URL: https://issues.apache.org/jira/browse/SPARK-23498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kevin Zhang
>Priority: Major
>
> While comparing a string column with integer value, spark sql will 
> automatically cast the string operant to int, the following sql will return 
> true in hive but false in spark
>  
> {code:java}
> select '1000.1'>1000
> {code}
>  
>  from the physical plan we can see the string operant was cast to int which 
> caused the accuracy loss
> {code:java}
> *Project [false AS (CAST(1000.1 AS INT) > 1000)#4]
> +- Scan OneRowRelation[]
> {code}
> To solve it, using a wider common type like double to cast both sides of 
> operant of a binary operator may be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23537) Logistic Regression without standardization

2018-02-28 Thread Jordi (JIRA)
Jordi created SPARK-23537:
-

 Summary: Logistic Regression without standardization
 Key: SPARK-23537
 URL: https://issues.apache.org/jira/browse/SPARK-23537
 Project: Spark
  Issue Type: Bug
  Components: ML, Optimizer
Affects Versions: 2.2.1
Reporter: Jordi


I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer to 
not use standardization since all my features are binary.

I trained two models to compare results, I've been expecting to end with two 
similar models since it seems that internally the optimizer performs 
standardization and "de-standardization" (when it's deactivated) in order to 
improve the convergence.

Here you have the code I used:
{code}
val lr = new org.apache.spark.ml.classification.LogisticRegression()
.setRegParam(0.05)
.setElasticNetParam(0.0)
.setFitIntercept(true)
.setMaxIter(5000)
.setStandardization(false)

val model = lr.fit(data)
{code}
The results are disturbing me, I end with two quite different models.

Standardization:

Training time: 8min.
Iterations: 37
Intercept: -4.386090107224499
Max weight: 4.724752299455218
Min weight: -3.560570478164854
Mean weight: -0.049325201841722795
l1 norm: 116710.39522171849
l2 norm: 402.2581552373957
Non zero weights: 128084
Non zero ratio: 0.12215042114257812

Last 10 LBFGS Val and Grand Norms:
{code}
18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
0.000559057
18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
0.000267527
18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
0.000205888
18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
0.000144173
18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
0.000140296
18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
0.000122709
18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
3.08789e-05
18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
2.23806e-05
18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
1.47422e-05
18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
2.37442e-05
{code}
No standardization:

Training time: 7h 14 min.
Iterations: 4992
Intercept: -4.216690468849263
Max weight: 0.41930559767624725
Min weight: -0.5949182537565524
Mean weight: -1.2659769019012E-6
l1 norm: 14.262025330648694
l2 norm: 1.2508777025612263
Non zero weights: 128955
Non zero ratio: 0.12298107147216797

Last 10 LBFGS Val and Grand Norms:
{code}
18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
0.217581
18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
0.185812
18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
0.214570
18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
0.489464
18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
0.178448
18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
0.172527
18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
0.189389
18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
0.480678
18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
0.184529
18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
0.154329
{code}

Am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23536) Update each Data frame row with a random value

2018-02-28 Thread Deenadayal (JIRA)
Deenadayal created SPARK-23536:
--

 Summary: Update each Data frame row with a random value
 Key: SPARK-23536
 URL: https://issues.apache.org/jira/browse/SPARK-23536
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.3.0
Reporter: Deenadayal


In a spark data frame, the functionality to update/change one/some of the 
columns with random values(each row of a column should be updated with a random 
value) is missing. The scala random functionality approach ending up with 
choosing a random value and updating all rows of a column of DF with the single 
value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-02-28 Thread Pascal GILLET (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380652#comment-16380652
 ] 

Pascal GILLET edited comment on SPARK-23499 at 2/28/18 4:54 PM:


Below a screenshot of the MesosClusterDispatcher UI showing Spark jobs along 
with the queue to which they are submitted:

 

!Screenshot from 2018-02-28 17-22-47.png!

 

 


was (Author: pgillet):
Below a screenshot of the MesosClusterDispatcher UI showing the Spark jobs 
along with the queue to which they are submitted:

 

!Screenshot from 2018-02-28 17-22-47.png!

 

 

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0
>Reporter: Pascal GILLET
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
>  # Conf on the dispatcher side
>  spark.mesos.dispatcher.queue.URGENT=1.0
>  # Conf on the driver side
>  spark.mesos.dispatcher.queue=URGENT
>  spark.mesos.role=URGENT



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23513) java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error

2018-02-28 Thread Anuroopa George (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380650#comment-16380650
 ] 

Anuroopa George edited comment on SPARK-23513 at 2/28/18 4:49 PM:
--

Could you please post the complete spark-submit (along with the 
parameters)command you are using?


was (Author: anuroopa george):
Could you please post the complete spark-submit command you are using?

> java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit 
> error 
> ---
>
> Key: SPARK-23513
> URL: https://issues.apache.org/jira/browse/SPARK-23513
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Examples, Input/Output, Java API
>Affects Versions: 1.4.0, 2.2.0
>Reporter: Rawia 
>Priority: Blocker
>
> Hello
> I'm trying to run a spark application (distributedWekaSpark) but  when I'm 
> using the spark-submit command I get this error
> {quote}{quote}ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) 
> java.io.IOException: Expected 12 fields, but got 5 for row: 
> outlook,temperature,humidity,windy,play
> {quote}{quote}
> I tried with other datasets but always the same error appeared, (always 12 
> fields expected)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-02-28 Thread Pascal GILLET (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal GILLET updated SPARK-23499:
--
Comment: was deleted

(was: I attached a screenshot of the MesosClusterDispatcher UI showing the 
Spark jobs along with the queue to which they are submitted)

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0
>Reporter: Pascal GILLET
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
>  # Conf on the dispatcher side
>  spark.mesos.dispatcher.queue.URGENT=1.0
>  # Conf on the driver side
>  spark.mesos.dispatcher.queue=URGENT
>  spark.mesos.role=URGENT



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-02-28 Thread Pascal GILLET (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380652#comment-16380652
 ] 

Pascal GILLET commented on SPARK-23499:
---

Below a screenshot of the MesosClusterDispatcher UI showing the Spark jobs 
along with the queue to which they are submitted:

 

!Screenshot from 2018-02-28 17-22-47.png!

 

 

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0
>Reporter: Pascal GILLET
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
>  # Conf on the dispatcher side
>  spark.mesos.dispatcher.queue.URGENT=1.0
>  # Conf on the driver side
>  spark.mesos.dispatcher.queue=URGENT
>  spark.mesos.role=URGENT



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-02-28 Thread Pascal GILLET (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal GILLET updated SPARK-23499:
--
Attachment: Screenshot from 2018-02-28 17-22-47.png

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0
>Reporter: Pascal GILLET
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
>  # Conf on the dispatcher side
>  spark.mesos.dispatcher.queue.URGENT=1.0
>  # Conf on the driver side
>  spark.mesos.dispatcher.queue=URGENT
>  spark.mesos.role=URGENT



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23514) Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf()

2018-02-28 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23514:

Fix Version/s: (was: 2.3.1)
   2.4.0

> Replace spark.sparkContext.hadoopConfiguration by 
> spark.sessionState.newHadoopConf()
> 
>
> Key: SPARK-23514
> URL: https://issues.apache.org/jira/browse/SPARK-23514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 2.4.0
>
>
> Check all the places where we directly use 
> {{spark.sparkContext.hadoopConfiguration}}. Instead, in some scenarios, it 
> makes more sense to call {{spark.sessionState.newHadoopConf()}} which blends 
> in settings from SQLConf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23514) Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf()

2018-02-28 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23514:
---

Assignee: Juliusz Sompolski

> Replace spark.sparkContext.hadoopConfiguration by 
> spark.sessionState.newHadoopConf()
> 
>
> Key: SPARK-23514
> URL: https://issues.apache.org/jira/browse/SPARK-23514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 2.4.0
>
>
> Check all the places where we directly use 
> {{spark.sparkContext.hadoopConfiguration}}. Instead, in some scenarios, it 
> makes more sense to call {{spark.sessionState.newHadoopConf()}} which blends 
> in settings from SQLConf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23514) Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf()

2018-02-28 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23514.
-
   Resolution: Fixed
Fix Version/s: 2.3.1

> Replace spark.sparkContext.hadoopConfiguration by 
> spark.sessionState.newHadoopConf()
> 
>
> Key: SPARK-23514
> URL: https://issues.apache.org/jira/browse/SPARK-23514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 2.3.1
>
>
> Check all the places where we directly use 
> {{spark.sparkContext.hadoopConfiguration}}. Instead, in some scenarios, it 
> makes more sense to call {{spark.sessionState.newHadoopConf()}} which blends 
> in settings from SQLConf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23513) java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error

2018-02-28 Thread Anuroopa George (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380650#comment-16380650
 ] 

Anuroopa George commented on SPARK-23513:
-

Could you please post the complete spark-submit command you are using?

> java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit 
> error 
> ---
>
> Key: SPARK-23513
> URL: https://issues.apache.org/jira/browse/SPARK-23513
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Examples, Input/Output, Java API
>Affects Versions: 1.4.0, 2.2.0
>Reporter: Rawia 
>Priority: Blocker
>
> Hello
> I'm trying to run a spark application (distributedWekaSpark) but  when I'm 
> using the spark-submit command I get this error
> {quote}{quote}ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) 
> java.io.IOException: Expected 12 fields, but got 5 for row: 
> outlook,temperature,humidity,windy,play
> {quote}{quote}
> I tried with other datasets but always the same error appeared, (always 12 
> fields expected)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-02-28 Thread Pascal GILLET (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380641#comment-16380641
 ] 

Pascal GILLET commented on SPARK-23499:
---

I attached a screenshot of the MesosClusterDispatcher UI showing the Spark jobs 
along with the queue to which they are submitted

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0
>Reporter: Pascal GILLET
>Priority: Major
> Fix For: 2.4.0
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
>  # Conf on the dispatcher side
>  spark.mesos.dispatcher.queue.URGENT=1.0
>  # Conf on the driver side
>  spark.mesos.dispatcher.queue=URGENT
>  spark.mesos.role=URGENT



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-02-28 Thread Pascal GILLET (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal GILLET updated SPARK-23499:
--
Attachment: (was: Screenshot from 2018-02-28 17-22-47.png)

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0
>Reporter: Pascal GILLET
>Priority: Major
> Fix For: 2.4.0
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
>  # Conf on the dispatcher side
>  spark.mesos.dispatcher.queue.URGENT=1.0
>  # Conf on the driver side
>  spark.mesos.dispatcher.queue=URGENT
>  spark.mesos.role=URGENT



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-02-28 Thread Pascal GILLET (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal GILLET updated SPARK-23499:
--
Attachment: Screenshot from 2018-02-28 17-22-47.png

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0
>Reporter: Pascal GILLET
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
>  # Conf on the dispatcher side
>  spark.mesos.dispatcher.queue.URGENT=1.0
>  # Conf on the driver side
>  spark.mesos.dispatcher.queue=URGENT
>  spark.mesos.role=URGENT



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20368) Support Sentry on PySpark workers

2018-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20368.
--
Resolution: Duplicate

Let me leave this resolved as a duplicate of SPARK-22959.

> Support Sentry on PySpark workers
> -
>
> Key: SPARK-20368
> URL: https://issues.apache.org/jira/browse/SPARK-20368
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Alexander Shorin
>Priority: Major
>
> [Setry|https://sentry.io] is a well known among Python developers system to 
> capture, classify, track and explain tracebacks, helping people better 
> understand what went wrong, how to reproduce the issue and fix it.
> Any Spark application on Python is actually divided on two parts:
> 1. The one that runs on "driver side". That part user may control in all the 
> ways it want and provide reports to Sentry is very easy to do here.
> 2. The one that runs on executors. That's Python UDFs and the rest 
> transformation functions. Unfortunately, here we cannot provide such kind of 
> feature. And that is the part this feature is about.
> In order to simplify developing experience, it would be nice to have optional 
> Sentry support on PySpark worker level.
> What this feature could looks like?
> 1. PySpark will have new extra named {{sentry}} which installs Sentry client 
> and the rest required things if are necessary. This is an optional 
> install-time dependency.
> 2. PySpark worker will be able to detect presence of Sentry support and send 
> error reports there. 
> 3. All configuration of Sentry could and will be done via standard Sentry`s 
> environment variables.
> What this feature will give to users?
> 1. Better exceptions in Sentry. From driver-side application, now all of them 
> get recorded as like `Py4JJavaError` where the real executor exception is 
> written in a traceback body.
> 2. Greater simplification of understanding context when thing went wrong and 
> why.
> 3. Simplify Python UDFs debug and issues reproduce.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23517) Make pyspark.util._exception_message produce the trace from Java side for Py4JJavaError

2018-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23517:


Assignee: Hyukjin Kwon

> Make pyspark.util._exception_message produce the trace from Java side for 
> Py4JJavaError
> ---
>
> Key: SPARK-23517
> URL: https://issues.apache.org/jira/browse/SPARK-23517
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.1
>
>
> Currently {{pyspark.util._exception_message}} doesn't show its trace and 
> message from Py4JJavaError as below:
> {code}
> >>> from pyspark.util import _exception_message
> >>> try:
> ... sc._jvm.java.lang.String(None)
> ... except Exception as e:
> ... pass
> ...
> >>> e.message
> ''
> {code}
> This is actually a problem in some code paths we can expect this error. For 
> example, see
> {code}
> from pyspark.sql.functions import udf
> spark.conf.set("spark.sql.execution.arrow.enabled", True)
> spark.range(1).select(udf(lambda x: [[]])()).toPandas()
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
> raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
> RuntimeError:
> Note: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to 
> disable this.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23517) Make pyspark.util._exception_message produce the trace from Java side for Py4JJavaError

2018-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23517.
--
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 20680
[https://github.com/apache/spark/pull/20680]

> Make pyspark.util._exception_message produce the trace from Java side for 
> Py4JJavaError
> ---
>
> Key: SPARK-23517
> URL: https://issues.apache.org/jira/browse/SPARK-23517
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.1
>
>
> Currently {{pyspark.util._exception_message}} doesn't show its trace and 
> message from Py4JJavaError as below:
> {code}
> >>> from pyspark.util import _exception_message
> >>> try:
> ... sc._jvm.java.lang.String(None)
> ... except Exception as e:
> ... pass
> ...
> >>> e.message
> ''
> {code}
> This is actually a problem in some code paths we can expect this error. For 
> example, see
> {code}
> from pyspark.sql.functions import udf
> spark.conf.set("spark.sql.execution.arrow.enabled", True)
> spark.range(1).select(udf(lambda x: [[]])()).toPandas()
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
> raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
> RuntimeError:
> Note: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to 
> disable this.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380486#comment-16380486
 ] 

Apache Spark commented on SPARK-23525:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/20696

> ALTER TABLE CHANGE COLUMN doesn't work for external hive table
> --
>
> Key: SPARK-23525
> URL: https://issues.apache.org/jira/browse/SPARK-23525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Pavlo Skliar
>Priority: Major
>
> {code:java}
> print(spark.sql("""
> SHOW CREATE TABLE test.trends
> """).collect()[0].createtab_stmt)
> /// OUTPUT
> CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
> COMMENT '', `amount` bigint COMMENT '')
> COMMENT ''
> PARTITIONED BY (`date` string COMMENT '')
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://x/x/'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1519729384',
>   'last_modified_time' = '1519645652',
>   'last_modified_by' = 'pavlo',
>   'last_castor_run_ts' = '1513561658.0'
> )
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''),
>  Row(col_name='metric', data_type='string', comment=''),
>  Row(col_name='amount', data_type='bigint', comment=''),
>  Row(col_name='date', data_type='string', comment=''),
>  Row(col_name='# Partition Information', data_type='', comment=''),
>  Row(col_name='# col_name', data_type='data_type', comment='comment'),
>  Row(col_name='date', data_type='string', comment='')]
> spark.sql("""alter table test.trends change column id id string comment 
> 'unique identifier'""")
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
> data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
> comment=''), Row(col_name='date', data_type='string', comment=''), 
> Row(col_name='# Partition Information', data_type='', comment=''), 
> Row(col_name='# col_name', data_type='data_type', comment='comment'), 
> Row(col_name='date', data_type='string', comment='')]
> {code}
> The strange is that I've assigned comment to the id field from hive 
> successfully, and it's visible in Hue UI, but it's still not visible in from 
> spark, and any spark requests doesn't have effect on the comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23525:


Assignee: (was: Apache Spark)

> ALTER TABLE CHANGE COLUMN doesn't work for external hive table
> --
>
> Key: SPARK-23525
> URL: https://issues.apache.org/jira/browse/SPARK-23525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Pavlo Skliar
>Priority: Major
>
> {code:java}
> print(spark.sql("""
> SHOW CREATE TABLE test.trends
> """).collect()[0].createtab_stmt)
> /// OUTPUT
> CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
> COMMENT '', `amount` bigint COMMENT '')
> COMMENT ''
> PARTITIONED BY (`date` string COMMENT '')
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://x/x/'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1519729384',
>   'last_modified_time' = '1519645652',
>   'last_modified_by' = 'pavlo',
>   'last_castor_run_ts' = '1513561658.0'
> )
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''),
>  Row(col_name='metric', data_type='string', comment=''),
>  Row(col_name='amount', data_type='bigint', comment=''),
>  Row(col_name='date', data_type='string', comment=''),
>  Row(col_name='# Partition Information', data_type='', comment=''),
>  Row(col_name='# col_name', data_type='data_type', comment='comment'),
>  Row(col_name='date', data_type='string', comment='')]
> spark.sql("""alter table test.trends change column id id string comment 
> 'unique identifier'""")
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
> data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
> comment=''), Row(col_name='date', data_type='string', comment=''), 
> Row(col_name='# Partition Information', data_type='', comment=''), 
> Row(col_name='# col_name', data_type='data_type', comment='comment'), 
> Row(col_name='date', data_type='string', comment='')]
> {code}
> The strange is that I've assigned comment to the id field from hive 
> successfully, and it's visible in Hue UI, but it's still not visible in from 
> spark, and any spark requests doesn't have effect on the comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23525:


Assignee: Apache Spark

> ALTER TABLE CHANGE COLUMN doesn't work for external hive table
> --
>
> Key: SPARK-23525
> URL: https://issues.apache.org/jira/browse/SPARK-23525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Pavlo Skliar
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> print(spark.sql("""
> SHOW CREATE TABLE test.trends
> """).collect()[0].createtab_stmt)
> /// OUTPUT
> CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
> COMMENT '', `amount` bigint COMMENT '')
> COMMENT ''
> PARTITIONED BY (`date` string COMMENT '')
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://x/x/'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1519729384',
>   'last_modified_time' = '1519645652',
>   'last_modified_by' = 'pavlo',
>   'last_castor_run_ts' = '1513561658.0'
> )
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''),
>  Row(col_name='metric', data_type='string', comment=''),
>  Row(col_name='amount', data_type='bigint', comment=''),
>  Row(col_name='date', data_type='string', comment=''),
>  Row(col_name='# Partition Information', data_type='', comment=''),
>  Row(col_name='# col_name', data_type='data_type', comment='comment'),
>  Row(col_name='date', data_type='string', comment='')]
> spark.sql("""alter table test.trends change column id id string comment 
> 'unique identifier'""")
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
> data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
> comment=''), Row(col_name='date', data_type='string', comment=''), 
> Row(col_name='# Partition Information', data_type='', comment=''), 
> Row(col_name='# col_name', data_type='data_type', comment='comment'), 
> Row(col_name='date', data_type='string', comment='')]
> {code}
> The strange is that I've assigned comment to the id field from hive 
> successfully, and it's visible in Hue UI, but it's still not visible in from 
> spark, and any spark requests doesn't have effect on the comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom

2018-02-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23508:
---

Assignee: zhoukang

> blockManagerIdCache in BlockManagerId may cause oom
> ---
>
> Key: SPARK-23508
> URL: https://issues.apache.org/jira/browse/SPARK-23508
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1, 2.2.1
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
> Attachments: elepahnt-oom1.png, elephant-oom.png
>
>
> blockManagerIdCache in BlockManagerId will not remove old values which may 
> cause oom
> {code:java}
> val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, 
> BlockManagerId]()
> {code}
> Since whenever we apply a new BlockManagerId, it will put into this map.
> below is an jmap:
> !elepahnt-oom1.png!
> !elephant-oom.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom

2018-02-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23508.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0
   2.2.2

Issue resolved by pull request 20667
[https://github.com/apache/spark/pull/20667]

> blockManagerIdCache in BlockManagerId may cause oom
> ---
>
> Key: SPARK-23508
> URL: https://issues.apache.org/jira/browse/SPARK-23508
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1, 2.2.1
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 2.2.2, 2.4.0, 2.3.1
>
> Attachments: elepahnt-oom1.png, elephant-oom.png
>
>
> blockManagerIdCache in BlockManagerId will not remove old values which may 
> cause oom
> {code:java}
> val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, 
> BlockManagerId]()
> {code}
> Since whenever we apply a new BlockManagerId, it will put into this map.
> below is an jmap:
> !elepahnt-oom1.png!
> !elephant-oom.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23527) Error with spark-submit and kerberos with TLS-enabled Hadoop cluster

2018-02-28 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380440#comment-16380440
 ] 

Yuming Wang commented on SPARK-23527:
-

It isn't bug, please check your hostname: {{host1.domain.com;host2.domain.com}}

> Error with spark-submit and kerberos with TLS-enabled Hadoop cluster
> 
>
> Key: SPARK-23527
> URL: https://issues.apache.org/jira/browse/SPARK-23527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.1
> Environment: core-site.xml
> 
>     hadoop.security.key.provider.path
>     kms://ht...@host1.domain.com;host2.domain.com:16000/kms
> 
> hdfs-site.xml
> 
>     dfs.encryption.key.provider.uri
>     kms://ht...@host1.domain.com;host2.domain.com:16000/kms
> 
>Reporter: Ron Gonzalez
>Priority: Critical
>
> For current configuration of our enterprise cluster, I submit using 
> spark-submit:
> ./spark-submit --master yarn --deploy-mode cluster --class 
> org.apache.spark.examples.SparkPi --conf 
> spark.yarn.jars=hdfs:/user/user1/spark/lib/*.jar 
> ../examples/jars/spark-examples_2.11-2.2.1.jar 10
> I am getting the following problem:
>  
> 18/02/27 21:03:48 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
> 3351181 for svchdc236d on ha-hdfs:nameservice1
> Exception in thread "main" java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: host1.domain.com;host2.domain.com
>  at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
>  at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:825)
>  at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:781)
>  at 
> org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
>  at 
> org.apache.spark.deploy.yarn.security.HadoopFSCredentialProvider$$anonfun$obtainCredentials$1.apply(HadoopFSCredentialProvider.scala:52)
>  
> If I get rid of the other host for the properties so instead of 
> kms://ht...@host1.domain.com;host2.domain.com:16000/kms, I convert it to:
> kms://ht...@host1.domain.com:16000/kms
> it fails with a different error:
> java.io.IOException: javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target
> If I do the same thing using spark 1.6, it works so it seems like a 
> regression...
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23528) Expose vital statistics of GaussianMixtureModel

2018-02-28 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380426#comment-16380426
 ] 

Marco Gaido commented on SPARK-23528:
-

The log likelihood is already available in the summary (eg. 
{{model.summary.logLikelihood}}). I will submit soon a PR adding the number of 
iterations. Thanks for reporting this.

> Expose vital statistics of GaussianMixtureModel
> ---
>
> Key: SPARK-23528
> URL: https://issues.apache.org/jira/browse/SPARK-23528
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Erich Schubert
>Priority: Minor
>
> Spark ML should expose vital statistics of the GMM model:
>  * *Number of iterations* (actual, not max) until the tolerance threshold was 
> hit: we can set a maximum, but how do we know the limit was large enough, and 
> how many iterations it really took?
>  * Final *log likelihood* of the model: if we run multiple times with 
> different starting conditions, how do we know which run converged to the 
> better fit?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16996) Hive ACID delta files not seen

2018-02-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380370#comment-16380370
 ] 

Steve Loughran commented on SPARK-16996:


Like I said, Spark is trouble; we've just been including the custom one used in 
spark itself because it is not standard at all

> Hive ACID delta files not seen
> --
>
> Key: SPARK-16996
> URL: https://issues.apache.org/jira/browse/SPARK-16996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
> Environment: Hive 1.2.1, Spark 1.5.2
>Reporter: Benjamin BONNET
>Priority: Critical
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : 
> http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 
> BUCKETS
> ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS 
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs  0 2016-08-10 14:03 
> /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a   a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a   a
> b   b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the 
> delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs  0 2016-08-10 15:31 
> /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21741:


Assignee: Apache Spark

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Major
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380368#comment-16380368
 ] 

Apache Spark commented on SPARK-21741:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/20695

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Priority: Major
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21741:


Assignee: (was: Apache Spark)

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Priority: Major
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23535) MinMaxScaler return 0.5 for an all zero column

2018-02-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23535:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

(Not a bug -- any value in the range 0 to 1 is coherent in this case.)

The first example shows the third column has values -1, 1, 3 -- is that just a 
typo?

I don't know that 0 or 1 are better answers here. Changing it changes behavior, 
too. Unless there's a modestly compelling case for matching behavior of another 
tool, I'd leave this.

> MinMaxScaler return 0.5 for an all zero column
> --
>
> Key: SPARK-23535
> URL: https://issues.apache.org/jira/browse/SPARK-23535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Yigal Weinberger
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When applying MinMaxScaler on a column that contains only 0 the output is 0.5 
> for all the column. 
> This is inconsistent with sklearn implementation
>  
> Steps to reproduce:
>  
>  
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> from pyspark.ml.linalg import Vectors
> dataFrame = spark.createDataFrame([
> (0, Vectors.dense([1.0, 0.1, -1.0]),),
> (1, Vectors.dense([2.0, 1.1, 1.0]),),
> (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> # Compute summary statistics and generate MinMaxScalerModel
> scalerModel = scaler.fit(dataFrame)
> # rescale each feature to range [min, max].
> scaledData = scalerModel.transform(dataFrame)
> print("Features scaled to range: [%f, %f]" % (scaler.getMin(), 
> scaler.getMax()))
> scaledData.select("features", "scaledFeatures").show()
> {code}
> Features scaled to range: [0.00, 1.00]
> +--+--+
> |features|scaledFeatures|
> +--+--+
> | [1.0,0.1,0.0]| [0.0,0.0,*0.5*]| |
> [2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |
> [3.0,10.1,0.0]| [1.0,1.0,*0.5*]|
> +--+--+
>  
> VS.
> {code:java}
> from sklearn.preprocessing import MinMaxScaler
> mms = MinMaxScaler(copy=False)
> test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
> print (mms.fit_transform(test))
> {code}
>  
> Output:
> [[ 0. 0. *0.* ]
> [ 0.5 0.1 *0.* ]
> [ 1. 1. *0.* ]]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-02-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23529:
--
Target Version/s:   (was: 2.3.0)
   Fix Version/s: (was: 2.3.0)

> Specify hostpath volume and mount the volume in Spark driver and executor 
> pods in Kubernetes
> 
>
> Key: SPARK-23529
> URL: https://issues.apache.org/jira/browse/SPARK-23529
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Suman Somasundar
>Assignee: Anirudh Ramanathan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23535) MinMaxScaler return 0.5 for an all zero column

2018-02-28 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380334#comment-16380334
 ] 

Marco Gaido commented on SPARK-23535:
-

I checked and each tool behaves in its own way when this case happens. sklearn 
behaves as you described. Rapidminer returns the min value if the value is less 
than the max value and the max value otherwise (ie. 0 if v < 1 else 1). Matlab 
assumes that this is not the case, otherwise it doesn't perform any 
transformation.

I am not sure if Spark has to strictly mirror sklearn's behavior since this 
case is not handled in a standard way across the tools. What do you think 
[~mlnick] [~srowen] [~josephkb]?

> MinMaxScaler return 0.5 for an all zero column
> --
>
> Key: SPARK-23535
> URL: https://issues.apache.org/jira/browse/SPARK-23535
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Yigal Weinberger
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When applying MinMaxScaler on a column that contains only 0 the output is 0.5 
> for all the column. 
> This is inconsistent with sklearn implementation
>  
> Steps to reproduce:
>  
>  
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> from pyspark.ml.linalg import Vectors
> dataFrame = spark.createDataFrame([
> (0, Vectors.dense([1.0, 0.1, -1.0]),),
> (1, Vectors.dense([2.0, 1.1, 1.0]),),
> (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> # Compute summary statistics and generate MinMaxScalerModel
> scalerModel = scaler.fit(dataFrame)
> # rescale each feature to range [min, max].
> scaledData = scalerModel.transform(dataFrame)
> print("Features scaled to range: [%f, %f]" % (scaler.getMin(), 
> scaler.getMax()))
> scaledData.select("features", "scaledFeatures").show()
> {code}
> Features scaled to range: [0.00, 1.00]
> +--+--+
> |features|scaledFeatures|
> +--+--+
> | [1.0,0.1,0.0]| [0.0,0.0,*0.5*]| |
> [2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |
> [3.0,10.1,0.0]| [1.0,1.0,*0.5*]|
> +--+--+
>  
> VS.
> {code:java}
> from sklearn.preprocessing import MinMaxScaler
> mms = MinMaxScaler(copy=False)
> test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
> print (mms.fit_transform(test))
> {code}
>  
> Output:
> [[ 0. 0. *0.* ]
> [ 0.5 0.1 *0.* ]
> [ 1. 1. *0.* ]]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23173:


Assignee: Apache Spark

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>  Labels: release-notes
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23173:


Assignee: (was: Apache Spark)

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380250#comment-16380250
 ] 

Apache Spark commented on SPARK-23173:
--

User 'mswit-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/20694

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380247#comment-16380247
 ] 

Apache Spark commented on SPARK-23523:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/20693

> Incorrect result caused by the rule OptimizeMetadataOnlyQuery
> -
>
> Key: SPARK-23523
> URL: https://issues.apache.org/jira/browse/SPARK-23523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> {code:scala}
>  val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
>  Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
>  .write.json(tablePath.getCanonicalPath)
>  val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", 
> "CoL3").distinct()
>  df.show()
> {code}
> This returns a wrong result 
> {{[c,e,a]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23531) When explain, plan's output should include attribute type info

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23531:


Assignee: (was: Apache Spark)

> When explain, plan's output should include attribute type info
> --
>
> Key: SPARK-23531
> URL: https://issues.apache.org/jira/browse/SPARK-23531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23531) When explain, plan's output should include attribute type info

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380235#comment-16380235
 ] 

Apache Spark commented on SPARK-23531:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20692

> When explain, plan's output should include attribute type info
> --
>
> Key: SPARK-23531
> URL: https://issues.apache.org/jira/browse/SPARK-23531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23531) When explain, plan's output should include attribute type info

2018-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23531:


Assignee: Apache Spark

> When explain, plan's output should include attribute type info
> --
>
> Key: SPARK-23531
> URL: https://issues.apache.org/jira/browse/SPARK-23531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

2018-02-28 Thread Kevin Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380126#comment-16380126
 ] 

Kevin Zhang commented on SPARK-14974:
-

I encountered the same problem with [~ussraf] in spark 2.2 and 2.3, and I'm not 
quite sure about how to fix it. Is there any plan to reopen the issue?

> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> --
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>Priority: Minor
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files 
> allowed to be create by each task or limit the spark_sql_partitions_num 
> without affect the concurrency?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23535) MinMaxScaler return 0.5 for an all zero column

2018-02-28 Thread Yigal Weinberger (JIRA)
Yigal Weinberger created SPARK-23535:


 Summary: MinMaxScaler return 0.5 for an all zero column
 Key: SPARK-23535
 URL: https://issues.apache.org/jira/browse/SPARK-23535
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0
Reporter: Yigal Weinberger


When applying MinMaxScaler on a column that contains only 0 the output is 0.5 
for all the column. 

This is inconsistent with sklearn implementation

 

Steps to reproduce:

 

 
{code:java}
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(dataFrame)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(dataFrame)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
scaledData.select("features", "scaledFeatures").show()
{code}
Features scaled to range: [0.00, 1.00]

+--+--+
|features|scaledFeatures|

+--+--+

| [1.0,0.1,0.0]| [0.0,0.0,*0.5*]| |

[2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |

[3.0,10.1,0.0]| [1.0,1.0,*0.5*]|

+--+--+

 

VS.
{code:java}
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler(copy=False)
test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
print (mms.fit_transform(test))
{code}
 

Output:

[[ 0. 0. *0.* ]

[ 0.5 0.1 *0.* ]

[ 1. 1. *0.* ]]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-02-28 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23534:

Description: 
Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark run 
on Hadoop 3.0.

The work includes:
 # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
 # Test to see if there's dependency issues with Hadoop 3.0.
 # Investigating the feasibility to use shaded client jars (HADOOP-11804).

  was:
This Jira tracks the work to make Spark run on Hadoop 3.0.

The work includes:
 # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
 # Test to see if there's dependency issues with Hadoop 3.0.
 # Investigating the feasibility to use shaded client jars (HADOOP-11804).


> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2018-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379971#comment-16379971
 ] 

Apache Spark commented on SPARK-18161:
--

User 'inpefess' has created a pull request for this issue:
https://github.com/apache/spark/pull/20691

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>Priority: Major
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-02-28 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-23534:
---

 Summary: Spark run on Hadoop 3.0.0
 Key: SPARK-23534
 URL: https://issues.apache.org/jira/browse/SPARK-23534
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.3.0
Reporter: Saisai Shao


This Jira tracks the work to make Spark run on Hadoop 3.0.

The work includes:
 # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
 # Test to see if there's dependency issues with Hadoop 3.0.
 # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23531) When explain, plan's output should include attribute type info

2018-02-28 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379967#comment-16379967
 ] 

Marco Gaido commented on SPARK-23531:
-

I am working on this. I will submit a PR soon.

> When explain, plan's output should include attribute type info
> --
>
> Key: SPARK-23531
> URL: https://issues.apache.org/jira/browse/SPARK-23531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org