date:20171204

[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-12-04 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278148#comment-16278148
 ] 

zhengruifeng commented on SPARK-19634:
--

I think we can now use the new summarizer in the algs. 

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.3.0
>
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2017-12-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278142#comment-16278142
 ] 

Hyukjin Kwon commented on SPARK-22674:
--

If that deduplication brings performance regression or  is unable/difficult to 
port it, we should consider a separate fix as you did. Sure. Sorry, I 
overlooked your comments. 

> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Jonas Amrich
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22690:


Assignee: (was: Apache Spark)

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22690:


Assignee: Apache Spark

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278128#comment-16278128
 ] 

Apache Spark commented on SPARK-22690:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19889

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22689) Could not resolve dependencies for project

2017-12-04 Thread Puja Mudaliar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Puja Mudaliar updated SPARK-22689:
--
Description: 
Hello team,

Spark code compile operation fails on few machines whereas the same source code 
passes on few other machines.
Please check the error on Centos (4.10.12-1.el7.elrepo.x86_64)

./build/mvn -X -DskipTests -Dscala.lib.directory=/usr/share/scala  -pl core 
compile
INFO] Building Spark Project Core 2.2.2-SNAPSHOT
[INFO] 
[WARNING] The POM for org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT 
is missing, no dependency information available
[WARNING] The POM for 
org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT is missing, no 
dependency information available
[WARNING] The POM for 
org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT is missing, no 
dependency information available
[WARNING] The POM for org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT is 
missing, no dependency information available
[WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT is 
missing, no dependency information available
[WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT 
is missing, no dependency information available
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 0.804 s
[INFO] Finished at: 2017-12-04T23:18:58-08:00
[INFO] Final Memory: 43M/1963M
[INFO] 
[ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve 
dependencies for project org.apache.spark:spark-core_2.11:jar:2.2.2-SNAPSHOT: 
The following artifacts could not be resolved: 
org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT: Failure to find 
org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT in 
http://artifact.eng.stellus.in:8081/artifactory/libs-snapshot was cached in the 
local repository, resolution will not be reattempted until the update interval 
of snapshots has elapsed or updates are forced -> [Help 1]

Note: The same source code passes on another CentOS 
machine(3.10.0-514.el7.x86_64)
./build/mvn -DskipTests -Dscala.lib.directory=/usr/share/scala  -pl core compile
[INFO] --- maven-compiler-plugin:3.7.0:compile (default-compile) @ 
spark-core_2.11 ---
[INFO] Not compiling main sources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
spark-core_2.11 ---
[INFO] Using zinc server for incremental compilation
[info] Compile success at Dec 4, 2017 11:17:34 PM [0.331s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 5.663 s
[INFO] Finished at: 2017-12-04T23:17:34-08:00
[INFO] Final Memory: 52M/1297M
[INFO] 


  was:
Hello team,

Spark code compile operation fails on few machined whereas the same source code 
passes on few other machines.Issue is not related to kernel version as I had 
tried using difference kernel versions.
Please check the error on Centos (4.10.12-1.el7.elrepo.x86_64)

./build/mvn -X -DskipTests -Dscala.lib.directory=/usr/share/scala  -pl core 
compile
INFO] Building Spark Project Core 2.2.2-SNAPSHOT
[INFO] 
[WARNING] The POM for org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT 
is missing, no dependency information available
[WARNING] The POM for 
org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT is missing, no 
dependency information available
[WARNING] The POM for 
org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT is missing, no 
dependency information available
[WARNING] The POM for org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT is 
missing, no dependency information available
[WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT is 
missing, no dependency information available
[WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT 
is missing, no dependency information available
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 0.804 s
[INFO] Finished at:

[jira] [Created] (SPARK-22689) Could not resolve dependencies for project

2017-12-04 Thread Puja Mudaliar (JIRA)

Puja Mudaliar created SPARK-22689:
-

 Summary:  Could not resolve dependencies for project 
 Key: SPARK-22689
 URL: https://issues.apache.org/jira/browse/SPARK-22689
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: Puja Mudaliar
Priority: Blocker


Hello team,

Spark code compile operation fails on few machined whereas the same source code 
passes on few other machines.Issue is not related to kernel version as I had 
tried using difference kernel versions.
Please check the error on Centos (4.10.12-1.el7.elrepo.x86_64)

./build/mvn -X -DskipTests -Dscala.lib.directory=/usr/share/scala  -pl core 
compile
INFO] Building Spark Project Core 2.2.2-SNAPSHOT
[INFO] 
[WARNING] The POM for org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT 
is missing, no dependency information available
[WARNING] The POM for 
org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT is missing, no 
dependency information available
[WARNING] The POM for 
org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT is missing, no 
dependency information available
[WARNING] The POM for org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT is 
missing, no dependency information available
[WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT is 
missing, no dependency information available
[WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT 
is missing, no dependency information available
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 0.804 s
[INFO] Finished at: 2017-12-04T23:18:58-08:00
[INFO] Final Memory: 43M/1963M
[INFO] 
[ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve 
dependencies for project org.apache.spark:spark-core_2.11:jar:2.2.2-SNAPSHOT: 
The following artifacts could not be resolved: 
org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT, 
org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT: Failure to find 
org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT in 
http://artifact.eng.stellus.in:8081/artifactory/libs-snapshot was cached in the 
local repository, resolution will not be reattempted until the update interval 
of snapshots has elapsed or updates are forced -> [Help 1]

Note: The same source code passes on another CentOS 
machine(3.10.0-514.el7.x86_64)
./build/mvn -DskipTests -Dscala.lib.directory=/usr/share/scala  -pl core compile
[INFO] --- maven-compiler-plugin:3.7.0:compile (default-compile) @ 
spark-core_2.11 ---
[INFO] Not compiling main sources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
spark-core_2.11 ---
[INFO] Using zinc server for incremental compilation
[info] Compile success at Dec 4, 2017 11:17:34 PM [0.331s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 5.663 s
[INFO] Finished at: 2017-12-04T23:17:34-08:00
[INFO] Final Memory: 52M/1297M
[INFO] 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-04 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-22690:


 Summary: Imputer inherit HasOutputCols
 Key: SPARK-22690
 URL: https://issues.apache.org/jira/browse/SPARK-22690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng
Priority: Trivial


trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also inherit 
it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2017-12-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278121#comment-16278121
 ] 

Hyukjin Kwon commented on SPARK-22674:
--

Oh, sorry, I overlooked at {{ that regular pickle won't be able to unpickle 
namedtuples anymore.}}.

I didn't mean to completely remove out the support with regular pickle one but 
deduplicates the logic for serializers if possible and matches PySpark's copy 
to specific version of cloudpickle, if possible. 

I'd like to avoid a separate fix within PySpark if we can.


> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Jonas Amrich
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22688) Upgrade Janino version 3.0.8

2017-12-04 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22688:
-
Description: 
[Janino 3.0.8|https://janino-compiler.github.io/janino/changelog.html] includes 
an important fix to reduce the number of constant pool entries by using 
{{sipush}} java bytecode.

* SIPUSH bytecode is not used for short integer constant 
[#33|https://github.com/janino-compiler/janino/issues/33]



  was:
[Janino 0.3.8|https://janino-compiler.github.io/janino/changelog.html] includes 
an important fix to reduce the number of constant pool entries by using 
{{sipush}} java bytecode.

* SIPUSH bytecode is not used for short integer constant 
[#33|https://github.com/janino-compiler/janino/issues/33]




> Upgrade Janino version 3.0.8
> 
>
> Key: SPARK-22688
> URL: https://issues.apache.org/jira/browse/SPARK-22688
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> [Janino 3.0.8|https://janino-compiler.github.io/janino/changelog.html] 
> includes an important fix to reduce the number of constant pool entries by 
> using {{sipush}} java bytecode.
> * SIPUSH bytecode is not used for short integer constant 
> [#33|https://github.com/janino-compiler/janino/issues/33]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22688) Upgrade Janino version 3.0.8

2017-12-04 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22688:
-
Summary: Upgrade Janino version 3.0.8  (was: Upgrade Janino version 0.3.8)

> Upgrade Janino version 3.0.8
> 
>
> Key: SPARK-22688
> URL: https://issues.apache.org/jira/browse/SPARK-22688
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> [Janino 0.3.8|https://janino-compiler.github.io/janino/changelog.html] 
> includes an important fix to reduce the number of constant pool entries by 
> using {{sipush}} java bytecode.
> * SIPUSH bytecode is not used for short integer constant 
> [#33|https://github.com/janino-compiler/janino/issues/33]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22688) Upgrade Janino version 0.3.8

2017-12-04 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-22688:


 Summary: Upgrade Janino version 0.3.8
 Key: SPARK-22688
 URL: https://issues.apache.org/jira/browse/SPARK-22688
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


[Janino 0.3.8|https://janino-compiler.github.io/janino/changelog.html] includes 
an important fix to reduce the number of constant pool entries by using 
{{sipush}} java bytecode.

* SIPUSH bytecode is not used for short integer constant 
[#33|https://github.com/janino-compiler/janino/issues/33]





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22660) Compile with scala-2.12 and JDK9

2017-12-04 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278064#comment-16278064
 ] 

liyunzhang commented on SPARK-22660:


Ok,create SPARK-22687 to record the problem about runtime.  
{quote}
But here you are already Hadoop 2 won't work with Java 9.
{quote}

sorry for not describing clearly, here the hadoop is hadoop-3.0.0 which is 
enabled by jdk9(HADOOP-14984, HADOOP-14978)


> Compile with scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Priority: Minor
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
>  ./dev/change-scala-version.sh 2.12
> 2.build with -Pscala-2.12
> for hive on spark
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> for spark sql
> {code}
> ./dev/make-distribution.sh  --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn -Phive 
> -Dhadoop.version=2.7.3>log.sparksql 2>&1
> {code}
> get following error
> #Error1
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix
> #Error2
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}
> The limit method was moved from ByteBuffer to the superclass Buffer and it 
> can no longer be called without (). The same reason for position method.
> #Error3
> {code}
> home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error] properties.putAll(propsMap.asJava)
>  [error]^
> [error] 
> /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error]   props.putAll(outputSerdeProps.toMap.asJava)
>  [error] ^
>  {code}
>  This is because the key type is Object instead of String which is unsafe.
> After solving these 3 errors, compile successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22687) Run spark-sql in scala-2.12 and JDK9

2017-12-04 Thread liyunzhang (JIRA)

liyunzhang created SPARK-22687:
--

 Summary: Run spark-sql in scala-2.12 and JDK9
 Key: SPARK-22687
 URL: https://issues.apache.org/jira/browse/SPARK-22687
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: liyunzhang


Based on SPARK-22660, running spark sql in scala-2.12 and JDK9 env. Here the 
hadoop used is enabled by JDK9(See HADOOP-14984, HADOOP-14978)
Here exception is
{code}
[root@bdpe41 spark-2.3.0-SNAPSHOT-bin-2.7.3]# ./bin/spark-shell 
spark-2.3.0-SNAPSHOT-bin-2.7.
^C[root@bdpe41 spark-2.3.0-SNAPSHOT-bin-2.7.3]# ./bin/spark-shell 
--driver-memory 1G
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by 
org.apache.hadoop.security.authentication.util.KerberosUtil 
(file:/home/zly/spark-2.3.0-SNAPSHOT-bin-2.7.3/jars/hadoop-auth-2.7.3.jar) to 
method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of 
org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
2017-12-05 03:03:23,511 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 9.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> Spark context Web UI available at http://bdpe41:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1512414208378).
Spark session available as 'spark'.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
warning: there was one deprecation warning (since 2.0.0); for details, enable 
`:setting -deprecation' or `:replay -deprecation'
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@8da0e54

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class Customer(customer_id: Int, name: String, city: String, state: 
String, zip_code: String)
defined class Customer

scala> val dfCustomers = 
sc.textFile("/home/zly/spark-2.3.0-SNAPSHOT-bin-2.7.3/customers.txt").map(_.split(",")).map(p
 => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF()

2017-12-05 03:04:02,647 WARN util.ClosureCleaner: Expected a closure; got 
org.apache.spark.SparkContext$$Lambda$2237/371823738
2017-12-05 03:04:02,649 WARN util.ClosureCleaner: Expected a closure; got 
org.apache.spark.SparkContext$$Lambda$2242/539107678
2017-12-05 03:04:02,651 WARN util.ClosureCleaner: Expected a closure; got 
$line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2245/345086812
2017-12-05 03:04:02,654 WARN util.ClosureCleaner: Expected a closure; got 
$line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2246/1829622584
2017-12-05 03:04:03,861 WARN metadata.Hive: Failed to access metastore. This 
class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:114)
at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:383)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at

[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2017-12-04 Thread Ashish Chopra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278060#comment-16278060
 ] 

Ashish Chopra commented on SPARK-8971:
--

When can we expect this in Dataframe API?

> Support balanced class labels when splitting train/cross validation sets
> 
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22686:


Assignee: Apache Spark

> DROP TABLE IF EXISTS should not throw AnalysisException
> ---
>
> Key: SPARK-22686
> URL: https://issues.apache.org/jira/browse/SPARK-22686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> During SPARK-22488 to Fix the view resolution issue, there occurs a 
> regression at 2.2.1 and master branch like the following.
> {code}
> scala> spark.version
> res2: String = 2.2.1
> scala> sql("DROP TABLE IF EXISTS t").show
> 17/12/04 21:01:06 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22686:


Assignee: (was: Apache Spark)

> DROP TABLE IF EXISTS should not throw AnalysisException
> ---
>
> Key: SPARK-22686
> URL: https://issues.apache.org/jira/browse/SPARK-22686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>
> During SPARK-22488 to Fix the view resolution issue, there occurs a 
> regression at 2.2.1 and master branch like the following.
> {code}
> scala> spark.version
> res2: String = 2.2.1
> scala> sql("DROP TABLE IF EXISTS t").show
> 17/12/04 21:01:06 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278022#comment-16278022
 ] 

Apache Spark commented on SPARK-22686:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19888

> DROP TABLE IF EXISTS should not throw AnalysisException
> ---
>
> Key: SPARK-22686
> URL: https://issues.apache.org/jira/browse/SPARK-22686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>
> During SPARK-22488 to Fix the view resolution issue, there occurs a 
> regression at 2.2.1 and master branch like the following.
> {code}
> scala> spark.version
> res2: String = 2.2.1
> scala> sql("DROP TABLE IF EXISTS t").show
> 17/12/04 21:01:06 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException

2017-12-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22686:
--
Summary: DROP TABLE IF EXISTS should not throw AnalysisException  (was: 
DROP TABLE IF NOT EXISTS should not throw AnalysisException)

> DROP TABLE IF EXISTS should not throw AnalysisException
> ---
>
> Key: SPARK-22686
> URL: https://issues.apache.org/jira/browse/SPARK-22686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>
> During SPARK-22488 to Fix the view resolution issue, there occurs a 
> regression at 2.2.1 and master branch like the following.
> {code}
> scala> spark.version
> res2: String = 2.2.1
> scala> sql("DROP TABLE IF EXISTS t").show
> 17/12/04 21:01:06 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> org.apache.spark.sql.AnalysisException: Table or view not found: t;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22686) DROP TABLE IF NOT EXISTS should not throw AnalysisException

2017-12-04 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-22686:
-

 Summary: DROP TABLE IF NOT EXISTS should not throw 
AnalysisException
 Key: SPARK-22686
 URL: https://issues.apache.org/jira/browse/SPARK-22686
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Dongjoon Hyun


During SPARK-22488 to Fix the view resolution issue, there occurs a regression 
at 2.2.1 and master branch like the following.

{code}
scala> spark.version
res2: String = 2.2.1

scala> sql("DROP TABLE IF EXISTS t").show
17/12/04 21:01:06 WARN DropTableCommand: 
org.apache.spark.sql.AnalysisException: Table or view not found: t;
org.apache.spark.sql.AnalysisException: Table or view not found: t;
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22682) HashExpression does not need to create global variables

2017-12-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22682.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19878
[https://github.com/apache/spark/pull/19878]

> HashExpression does not need to create global variables
> ---
>
> Key: SPARK-22682
> URL: https://issues.apache.org/jira/browse/SPARK-22682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22677) cleanup whole stage codegen for hash aggregate

2017-12-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22677.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19869
[https://github.com/apache/spark/pull/19869]

> cleanup whole stage codegen for hash aggregate
> --
>
> Key: SPARK-22677
> URL: https://issues.apache.org/jira/browse/SPARK-22677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22365) Spark UI executors empty list with 500 error

2017-12-04 Thread bruce xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618
 ] 

bruce xu edited comment on SPARK-22365 at 12/5/17 4:31 AM:
---

Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark 
ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I 
will also try to find the reason. Maybe it's a bug anyway.

UPDATE:
[~dubovsky]  I solved the problem by deleting jsr311-api-1.1.1.jar from 
$SPARK_HOME/jars. Reasons can be refered through  [NoSuchMethodError on startup 
in Java Jersey 
app|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app].

[~sowen]  Delete jsr311-api-1.1.1.jar could solve the problem, but I wonder if 
this is the root cause.


was (Author: xwc3504):
Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark 
ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I 
will also try to find the reason. Maybe it's a bug anyway.

UPDATE:
[~dubovsky]  I solved the problem by deleting jsr311-api-1.1.1.jar from 
$SPARK_HOME/jars. Reasons can be refered through  [NoSuchMethodError on startup 
in Java Jersey 
app|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app]

> Spark UI executors empty list with 500 error
> 
>
> Key: SPARK-22365
> URL: https://issues.apache.org/jira/browse/SPARK-22365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
> Attachments: spark-executor-500error.png
>
>
> No data loaded on "execturos" tab in sparkUI with stack trace below. Apart 
> from exception I have nothing more. But if I can test something to make this 
> easier to resolve I am happy to help.
> {code}
> java.lang.NullPointerException
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22365) Spark UI executors empty list with 500 error

2017-12-04 Thread bruce xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618
 ] 

bruce xu edited comment on SPARK-22365 at 12/5/17 3:47 AM:
---

Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark 
ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I 
will also try to find the reason. Maybe it's a bug anyway.

UPDATE:
[~dubovsky]  I solved the problem by deleting jsr311-api-1.1.1.jar from 
$SPARK_HOME/jars. Reasons can be refered through  [NoSuchMethodError on startup 
in Java Jersey 
app|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app]


was (Author: xwc3504):
Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark 
ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I 
will also try to find the reason. Maybe it's a bug anyway.

UPDATE:
[~dubovsky]  I solved the problem by deleting jsr311-api-1.1.1.jar from 
$SPARK_HOME/jars. Reasons can be refered through  [NoSuchMethodError on startup 
in Java Jersey app

|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app]

> Spark UI executors empty list with 500 error
> 
>
> Key: SPARK-22365
> URL: https://issues.apache.org/jira/browse/SPARK-22365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
> Attachments: spark-executor-500error.png
>
>
> No data loaded on "execturos" tab in sparkUI with stack trace below. Apart 
> from exception I have nothing more. But if I can test something to make this 
> easier to resolve I am happy to help.
> {code}
> java.lang.NullPointerException
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22365) Spark UI executors empty list with 500 error

2017-12-04 Thread bruce xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618
 ] 

bruce xu edited comment on SPARK-22365 at 12/5/17 3:46 AM:
---

Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark 
ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I 
will also try to find the reason. Maybe it's a bug anyway.

UPDATE:
[~dubovsky]  I solved the problem by deleting jsr311-api-1.1.1.jar from 
$SPARK_HOME/jars. Reasons can be refered through  [NoSuchMethodError on startup 
in Java Jersey app

|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app]


was (Author: xwc3504):
Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark 
ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I 
will also try to find the reason. Maybe it's a bug anyway.

> Spark UI executors empty list with 500 error
> 
>
> Key: SPARK-22365
> URL: https://issues.apache.org/jira/browse/SPARK-22365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
> Attachments: spark-executor-500error.png
>
>
> No data loaded on "execturos" tab in sparkUI with stack trace below. Apart 
> from exception I have nothing more. But if I can test something to make this 
> easier to resolve I am happy to help.
> {code}
> java.lang.NullPointerException
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18801) Support resolve a nested view

2017-12-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18801:
--
Fix Version/s: 2.2.0

> Support resolve a nested view
> -
>
> Key: SPARK-18801
> URL: https://issues.apache.org/jira/browse/SPARK-18801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> We should be able to resolve a nested view. The main advantage is that if you 
> update an underlying view, the current view also gets updated.
> The new approach should be compatible with older versions of SPARK/HIVE, that 
> means:
>   1. The new approach should be able to resolve the views that created by 
> older versions of SPARK/HIVE;
>   2. The new approach should be able to resolve the views that are 
> currently supported by SPARK SQL.
> The new approach mainly brings in the following changes:
>   1. Add a new operator called `View` to keep track of the CatalogTable 
> that describes the view, and the output attributes as well as the child of 
> the view;
>   2. Update the `ResolveRelations` rule to resolve the relations and 
> views, note that a nested view should be resolved correctly;
>   3. Add `AnalysisContext` to enable us to still support a view created 
> with CTE/Windows query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277939#comment-16277939
 ] 

Apache Spark commented on SPARK-21168:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19887

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22685) Spark Streaming using Kinesis doesn't work if shard checkpoints exist in DynamoDB

2017-12-04 Thread Grega Kespret (JIRA)

Grega Kespret created SPARK-22685:
-

 Summary: Spark Streaming using Kinesis doesn't work if shard 
checkpoints exist in DynamoDB
 Key: SPARK-22685
 URL: https://issues.apache.org/jira/browse/SPARK-22685
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Grega Kespret


Apologize if this is not the best place to post this / if the description is 
lacking some needed info. Please let me know and I will update.

This was cross-posted on 
[StackOverflow|https://stackoverflow.com/questions/47644984/spark-streaming-using-kinesis-doesnt-work-if-shard-checkpoints-exist-in-dynamod].

**TL;DR** – If shard checkpoints don't exist in DynamoDB (== completely fresh), 
Spark Streaming application reading from Kinesis works flawlessly. However, if 
the checkpoints exist (e.g. due to app restart), it fails most of the times.


The app uses **Spark Streaming 2.2.0** and 
**spark-streaming-kinesis-asl_2.11**. 
When starting the app with checkpointed shard data (written by KCL to 
DynamoDB), after a few successful batches (number varies), this is what I can 
see in the logs:

First, **Leases are lost**:

{code}
17/12/01 05:16:50 INFO LeaseRenewer: Worker 
10.0.182.119:9781acd5-6cb3-4a39-a235-46f1254eb885 lost lease with key 
shardId-0515
{code}

Then in random order: **Can't update checkpoint - instance doesn't hold the 
lease for this shard** and **com.amazonaws.SdkClientException: Unable to 
execute HTTP request: The target server failed to respond** follow, bringing 
down the whole app in a few batches:

{code}
17/12/01 05:17:10 ERROR ProcessTask: ShardId shardId-0394: Caught 
exception:
com.amazonaws.SdkClientException: Unable to execute HTTP request: The 
target server failed to respond
at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1069)
at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035)
at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at 
com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:1948)
at 
com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:1924)
at 
com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:969)
at 
com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:945)
at 
com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.get(KinesisProxy.java:156)
at 
com.amazonaws.services.kinesis.clientlibrary.proxies.MetricsCollectingKinesisProxyDecorator.get(MetricsCollectingKinesisProxyDecorator.java:74)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher.getRecords(KinesisDataFetcher.java:68)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResultAndRecordMillisBehindLatest(ProcessTask.java:291)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResult(ProcessTask.java:256)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:127)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
  Caused by: org.apache.http.NoHttpResponseException: The target server 
failed to respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at

[jira] [Commented] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-12-04 Thread liuzhaokun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277875#comment-16277875
 ] 

liuzhaokun commented on SPARK-21168:


[~dixingx...@yeah.net]
Hi,as your PR are not in progess,can I create a new PR to fix this problems?

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22656) Upgrade Arrow to 0.8.0

2017-12-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22656.
--
Resolution: Duplicate

> Upgrade Arrow to 0.8.0
> --
>
> Key: SPARK-22656
> URL: https://issues.apache.org/jira/browse/SPARK-22656
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> Arrow 0.8.0 will upgrade Netty to 4.1.x and unblock SPARK-19552



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22665) Dataset API: .repartition() inconsistency / issue

2017-12-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22665:
---

Assignee: Marco Gaido

> Dataset API: .repartition() inconsistency / issue
> -
>
> Key: SPARK-22665
> URL: https://issues.apache.org/jira/browse/SPARK-22665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrian Ionescu
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> We currently have two functions for explicitly repartitioning a Dataset:
> {code}
> def repartition(numPartitions: Int)
> {code}
> and
> {code}
> def repartition(numPartitions: Int, partitionExprs: Column*)
> {code}
> The second function's signature allows it to be called with an empty list of 
> expressions as well. 
> However:
> * {{df.repartition(numPartitions)}} does RoundRobin partitioning
> * {{df.repartition(numPartitions, Seq.empty: _*)}} does HashPartitioning on a 
> constant, effectively moving all tuples to a single partition
> Not only is this inconsistent, but the latter behavior is very undesirable: 
> it may hide problems in small-scale prototype code, but will inevitably fail 
> (or have terrible performance) in production.
> I suggest we should make it:
> - either throw an {{IllegalArgumentException}}
> - or do RoundRobin partitioning, just like {{df.repartition(numPartitions)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22665) Dataset API: .repartition() inconsistency / issue

2017-12-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22665.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Dataset API: .repartition() inconsistency / issue
> -
>
> Key: SPARK-22665
> URL: https://issues.apache.org/jira/browse/SPARK-22665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrian Ionescu
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> We currently have two functions for explicitly repartitioning a Dataset:
> {code}
> def repartition(numPartitions: Int)
> {code}
> and
> {code}
> def repartition(numPartitions: Int, partitionExprs: Column*)
> {code}
> The second function's signature allows it to be called with an empty list of 
> expressions as well. 
> However:
> * {{df.repartition(numPartitions)}} does RoundRobin partitioning
> * {{df.repartition(numPartitions, Seq.empty: _*)}} does HashPartitioning on a 
> constant, effectively moving all tuples to a single partition
> Not only is this inconsistent, but the latter behavior is very undesirable: 
> it may hide problems in small-scale prototype code, but will inevitably fail 
> (or have terrible performance) in production.
> I suggest we should make it:
> - either throw an {{IllegalArgumentException}}
> - or do RoundRobin partitioning, just like {{df.repartition(numPartitions)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277795#comment-16277795
 ] 

Apache Spark commented on SPARK-22162:
--

User 'rezasafi' has created a pull request for this issue:
https://github.com/apache/spark/pull/19886

> Executors and the driver use inconsistent Job IDs during the new RDD commit 
> protocol
> 
>
> Key: SPARK-22162
> URL: https://issues.apache.org/jira/browse/SPARK-22162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>Assignee: Reza Safi
> Fix For: 2.3.0
>
>
> After SPARK-18191 commit in pull request 15769, using the new commit protocol 
> it is possible that driver and executors uses different jobIds during a rdd 
> commit.
> In the old code, the variable stageId is part of the closure used to define 
> the task as you can see here:
>  
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098]
> As a result, a TaskAttemptId is constructed in executors using the same 
> "stageId" as the driver, since it is a value that is serialized in the 
> driver. Also the value of stageID is actually the rdd.id which is assigned 
> here: 
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084]
> However, after the change in pull request 15769, the value is no longer part 
> of the task closure, which gets serialized by the driver. Instead, it is 
> pulled from the taskContext as you can see 
> here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103]
> and then that value is used to construct the TaskAttemptId on the executors: 
> [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134]
> taskContext has a stageID value which will be set in DAGScheduler. So after 
> the change unlike the old code which a rdd.id was used, an actual stage.id is 
> used which can be different between executors and the driver since it is no 
> longer serialized.
> In summary, the old code consistently used rddId, and just incorrectly named 
> it "stageId".
> The new code uses a mix of rddId and stageId. There should be a consistent ID 
> between executors and the drivers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277787#comment-16277787
 ] 

Apache Spark commented on SPARK-22587:
--

User 'merlintang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19885

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2017-12-04 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273745#comment-16273745
 ] 

Mingjie Tang edited comment on SPARK-22587 at 12/5/17 12:01 AM:


we can update the compareFS by considering the authority. 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442

The PR is sent out. 
https://github.com/apache/spark/pull/19885



was (Author: merlin):
we can update the compareFS by considering the authority. 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442

I would send out a PR soon. 

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22587:


Assignee: (was: Apache Spark)

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22587:


Assignee: Apache Spark

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Assignee: Apache Spark
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22324:


Assignee: Apache Spark

> Upgrade Arrow to version 0.8.0
> --
>
> Key: SPARK-22324
> URL: https://issues.apache.org/jira/browse/SPARK-22324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335
> * VectorSchemaRoot can call close more than once to simplify listener 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277755#comment-16277755
 ] 

Apache Spark commented on SPARK-22324:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/19884

> Upgrade Arrow to version 0.8.0
> --
>
> Key: SPARK-22324
> URL: https://issues.apache.org/jira/browse/SPARK-22324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335
> * VectorSchemaRoot can call close more than once to simplify listener 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22324:


Assignee: (was: Apache Spark)

> Upgrade Arrow to version 0.8.0
> --
>
> Key: SPARK-22324
> URL: https://issues.apache.org/jira/browse/SPARK-22324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335
> * VectorSchemaRoot can call close more than once to simplify listener 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22599) Avoid extra reading for cached table

2017-12-04 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277754#comment-16277754
 ] 

Rajesh Balamohan commented on SPARK-22599:
--

[~CodingCat] - Thanks for sharing results. Results mentions"SPARK-22599, master 
branch, parquet".  Does it mean that "SPARK-22599, master branch" were run with 
text data?



> Avoid extra reading for cached table
> 
>
> Key: SPARK-22599
> URL: https://issues.apache.org/jira/browse/SPARK-22599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Nan Zhu
>
> In the current implementation of Spark, InMemoryTableExec read all data in a 
> cached table, filter CachedBatch according to stats and pass data to the 
> downstream operators. This implementation makes it inefficient to reside the 
> whole table in memory to serve various queries against different partitions 
> of the table, which occupies a certain portion of our users' scenarios.
> The following is an example of such a use case:
> store_sales is a 1TB-sized table in cloud storage, which is partitioned by 
> 'location'. The first query, Q1, wants to output several metrics A, B, C for 
> all stores in all locations. After that, a small team of 3 data scientists 
> wants to do some causal analysis for the sales in different locations. To 
> avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache 
> the whole table in memory in Q1.
> With the current implementation, even any one of the data scientists is only 
> interested in one out of three locations, the queries they submit to Spark 
> cluster is still reading 1TB data completely.
> The reason behind the extra reading operation is that we implement 
> CachedBatch as
> {code}
> case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
> InternalRow)
> {code}
> where "stats" is a part of every CachedBatch, so we can only filter batches 
> for output of InMemoryTableExec operator by reading all data in in-memory 
> table as input. The extra reading would be even more unacceptable when some 
> of the table's data is evicted to disks.
> We propose to introduce a new type of block, metadata block, for the 
> partitions of RDD representing data in the cached table. Every metadata block 
> contains stats info for all columns in a partition and is saved to 
> BlockManager when executing compute() method for the partition. To minimize 
> the number of bytes to read,
> More details can be found in design 
> doc:https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing
> performance test results:
> Environment: 6 Executors, each of which has 16 cores 90G memory
> dataset: 1T TPCDS data
> queries: tested 4 queries (Q19, Q46, Q34, Q27) in 
> https://github.com/databricks/spark-sql-perf/blob/c2224f37e50628c5c8691be69414ec7f5a3d919a/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala
> results: 
> https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20368) Support Sentry on PySpark workers

2017-12-04 Thread Taylor Edmiston (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277710#comment-16277710
 ] 

Taylor Edmiston commented on SPARK-20368:
-

I also posted this on the PR linked in the comment above, but I'd like to 
inquire about the status of this PR.

Is it something that could be merged?

Exception aggregation with Sentry in Python is such a common feature, and it's 
something I really need as well.  I'd be happy to jump in and help push this 
over the finish line if possible.

> Support Sentry on PySpark workers
> -
>
> Key: SPARK-20368
> URL: https://issues.apache.org/jira/browse/SPARK-20368
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Alexander Shorin
>
> [Setry|https://sentry.io] is a well known among Python developers system to 
> capture, classify, track and explain tracebacks, helping people better 
> understand what went wrong, how to reproduce the issue and fix it.
> Any Spark application on Python is actually divided on two parts:
> 1. The one that runs on "driver side". That part user may control in all the 
> ways it want and provide reports to Sentry is very easy to do here.
> 2. The one that runs on executors. That's Python UDFs and the rest 
> transformation functions. Unfortunately, here we cannot provide such kind of 
> feature. And that is the part this feature is about.
> In order to simplify developing experience, it would be nice to have optional 
> Sentry support on PySpark worker level.
> What this feature could looks like?
> 1. PySpark will have new extra named {{sentry}} which installs Sentry client 
> and the rest required things if are necessary. This is an optional 
> install-time dependency.
> 2. PySpark worker will be able to detect presence of Sentry support and send 
> error reports there. 
> 3. All configuration of Sentry could and will be done via standard Sentry`s 
> environment variables.
> What this feature will give to users?
> 1. Better exceptions in Sentry. From driver-side application, now all of them 
> get recorded as like `Py4JJavaError` where the real executor exception is 
> written in a traceback body.
> 2. Greater simplification of understanding context when thing went wrong and 
> why.
> 3. Simplify Python UDFs debug and issues reproduce.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-12-04 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277615#comment-16277615
 ] 

Li Jin commented on SPARK-21187:


Gotcha. Thanks!

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * -*Date*-
> * -*Timestamp*-
> * *Complex*: Struct, Array, Map
> * *Decimal*
> Some things to do before closing this out:
> * Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)
> * Need to add some user docs
> * Make sure Python tests are thorough
> * Check into complex type support mentioned in comments by [~leif], should we 
> support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2017-12-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277577#comment-16277577
 ] 

Hyukjin Kwon commented on SPARK-22674:
--

Basically yes, for now. I think we should avoid having a different change alone 
in PySpark anymore to reduce overhead, for example, maintaianing, reviewing 
costs, etc. in general. Performance measurement should also be a good step for 
it before we decide to go ahead.

> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Jonas Amrich
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22684:


Assignee: Apache Spark

> Avoid the generation of useless mutable states by datetime functions
> 
>
> Key: SPARK-22684
> URL: https://issues.apache.org/jira/browse/SPARK-22684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>
> Some datetime functions are defining mutable states which are not needed at 
> all. This is bad for the well known issues related to constant pool limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22684:


Assignee: (was: Apache Spark)

> Avoid the generation of useless mutable states by datetime functions
> 
>
> Key: SPARK-22684
> URL: https://issues.apache.org/jira/browse/SPARK-22684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>
> Some datetime functions are defining mutable states which are not needed at 
> all. This is bad for the well known issues related to constant pool limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277559#comment-16277559
 ] 

Apache Spark commented on SPARK-22684:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19883

> Avoid the generation of useless mutable states by datetime functions
> 
>
> Key: SPARK-22684
> URL: https://issues.apache.org/jira/browse/SPARK-22684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>
> Some datetime functions are defining mutable states which are not needed at 
> all. This is bad for the well known issues related to constant pool limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22672) Refactor ORC Tests

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277553#comment-16277553
 ] 

Apache Spark commented on SPARK-22672:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19882

> Refactor ORC Tests
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> Since SPARK-20682, we have two `OrcFileFormat`s. This issue refactor ORC 
> tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-12-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277549#comment-16277549
 ] 

Bryan Cutler commented on SPARK-21187:
--

Hi [~icexelloss], StructType has been added on the Java side, but still needs 
some work for it to be used in pyspark.  It needs some of the same functions 
used for ArrayType, which I can submit a PR for soon, but will need to upgrade 
Arrow to 0.8 before it can be merged.  

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * -*Date*-
> * -*Timestamp*-
> * *Complex*: Struct, Array, Map
> * *Decimal*
> Some things to do before closing this out:
> * Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)
> * Need to add some user docs
> * Make sure Python tests are thorough
> * Check into complex type support mentioned in comments by [~leif], should we 
> support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions

2017-12-04 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-22684:
---

 Summary: Avoid the generation of useless mutable states by 
datetime functions
 Key: SPARK-22684
 URL: https://issues.apache.org/jira/browse/SPARK-22684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


Some datetime functions are defining mutable states which are not needed at 
all. This is bad for the well known issues related to constant pool limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22672) Refactor ORC Tests

2017-12-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22672:
--
Description: Since SPARK-20682, we have two `OrcFileFormat`s. This issue 
refactor ORC tests.  (was: To support ORC tests without Hive, we had better 
have `OrcTest` in `sql/core` instead of `sql/hive`.)

> Refactor ORC Tests
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> Since SPARK-20682, we have two `OrcFileFormat`s. This issue refactor ORC 
> tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22672) Refactor ORC Tests

2017-12-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22672:
--
Summary: Refactor ORC Tests  (was: Move OrcTest to `sql/core`)

> Refactor ORC Tests
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` 
> instead of `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22672) Refactor ORC Tests

2017-12-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22672:
--
Priority: Major  (was: Trivial)

> Refactor ORC Tests
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` 
> instead of `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2017-12-04 Thread Jonas Amrich (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277424#comment-16277424
 ] 

Jonas Amrich commented on SPARK-22674:
--

Sure, you're right that pickle won't unpickle it without class definition. As 
far as I know PySpark uses pickle serializer as default and the hijack is there 
to enable namedtuple pickling and unpickling with regular pickle.

Do you propose removing the hijack? Removing it would mean that regular pickle 
won't be able to unpickle namedtuples anymore. And therefore cloudpickle will 
have to be used as default, which is quite big change (and IMHO not very good 
for performance).

> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Jonas Amrich
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22372) Make YARN client extend SparkApplication

2017-12-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22372.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19631
[https://github.com/apache/spark/pull/19631]

> Make YARN client extend SparkApplication
> 
>
> Key: SPARK-22372
> URL: https://issues.apache.org/jira/browse/SPARK-22372
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> For SPARK-11035 to work well, at least in cluster mode, YARN needs to 
> implement {{SparkApplication}} so that it doesn't use system properties to 
> propagate Spark configuration from spark-submit.
> There is a second complication, that YARN uses system properties to propagate 
> {{SPARK_YARN_MODE}} on top of other Spark configs. We should take a look at 
> either change that to a configuration, or remove {{SPARK_YARN_MODE}} 
> altogether if possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-12-04 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277271#comment-16277271
 ] 

Li Jin commented on SPARK-21187:


[~bryanc] Thanks for the update!

Is there any thing particular needs to be done for StructType? Seems it has 
been handled:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java#L318
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L63

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * -*Date*-
> * -*Timestamp*-
> * *Complex*: Struct, Array, Map
> * *Decimal*
> Some things to do before closing this out:
> * Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)
> * Need to add some user docs
> * Make sure Python tests are thorough
> * Check into complex type support mentioned in comments by [~leif], should we 
> support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-04 Thread Julien Cuquemelle (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Cuquemelle updated SPARK-22683:
--
 Labels: pull-request-available  (was: )
Description: 
let's say an executor has spark.executor.cores / spark.task.cpus taskSlots

The current dynamic allocation policy allocates enough executors
to have each taskSlot execute a single task, which minimizes latency, 
but wastes resources when tasks are small regarding executor allocation
overhead. 

By adding the tasksPerExecutorSlot, it is made possible to specify how many 
tasks
a single slot should ideally execute to mitigate the overhead of executor
allocation.

PR: https://github.com/apache/spark/pull/19881

  was:
let's say an executor has spark.executor.cores / spark.task.cpus taskSlots

The current dynamic allocation policy allocates enough executors
to have each taskSlot execute a single task, which minimizes latency, 
but wastes resources when tasks are small regarding executor allocation
overhead. 

By adding the tasksPerExecutorSlot, it is made possible to specify how many 
tasks
a single slot should ideally execute to mitigate the overhead of executor
allocation.


> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277187#comment-16277187
 ] 

Apache Spark commented on SPARK-22683:
--

User 'jcuquemelle' has created a pull request for this issue:
https://github.com/apache/spark/pull/19881

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22683:


Assignee: (was: Apache Spark)

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22683:


Assignee: Apache Spark

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Assignee: Apache Spark
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-04 Thread Julien Cuquemelle (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Cuquemelle updated SPARK-22683:
--
Priority: Major  (was: Minor)

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22683:
--
Target Version/s:   (was: 2.1.1, 2.2.0)

The overhead of small tasks doesn't change if you over-commit tasks with 
respect to task slots. I think this isn't really a solution, and the app needs 
to look at ways to make fewer, larger tasks.

There's overhead to adding yet another knob to turn here, and its interaction 
with other settings isn't obvious. This concept isn't present elsewhere in 
Spark.

You will also kind of get this effect anyway; if tasks are finishing very 
quickly, and locality wait is at all positive, you'll find tasks tend to favor 
older executors with cached data, and the newer ones, dynamically allocated, 
may get few or no tasks and deallocate anyway. Allocation only happens when the 
task backlog builds up.

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Minor
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-04 Thread Julien Cuquemelle (JIRA)

Julien Cuquemelle created SPARK-22683:
-

 Summary: Allow tuning the number of dynamically allocated 
executors wrt task number
 Key: SPARK-22683
 URL: https://issues.apache.org/jira/browse/SPARK-22683
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0
Reporter: Julien Cuquemelle
Priority: Minor


let's say an executor has spark.executor.cores / spark.task.cpus taskSlots

The current dynamic allocation policy allocates enough executors
to have each taskSlot execute a single task, which minimizes latency, 
but wastes resources when tasks are small regarding executor allocation
overhead. 

By adding the tasksPerExecutorSlot, it is made possible to specify how many 
tasks
a single slot should ideally execute to mitigate the overhead of executor
allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol

2017-12-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22162.

   Resolution: Fixed
 Assignee: Reza Safi
Fix Version/s: 2.3.0

> Executors and the driver use inconsistent Job IDs during the new RDD commit 
> protocol
> 
>
> Key: SPARK-22162
> URL: https://issues.apache.org/jira/browse/SPARK-22162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>Assignee: Reza Safi
> Fix For: 2.3.0
>
>
> After SPARK-18191 commit in pull request 15769, using the new commit protocol 
> it is possible that driver and executors uses different jobIds during a rdd 
> commit.
> In the old code, the variable stageId is part of the closure used to define 
> the task as you can see here:
>  
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098]
> As a result, a TaskAttemptId is constructed in executors using the same 
> "stageId" as the driver, since it is a value that is serialized in the 
> driver. Also the value of stageID is actually the rdd.id which is assigned 
> here: 
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084]
> However, after the change in pull request 15769, the value is no longer part 
> of the task closure, which gets serialized by the driver. Instead, it is 
> pulled from the taskContext as you can see 
> here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103]
> and then that value is used to construct the TaskAttemptId on the executors: 
> [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134]
> taskContext has a stageID value which will be set in DAGScheduler. So after 
> the change unlike the old code which a rdd.id was used, an actual stage.id is 
> used which can be different between executors and the driver since it is no 
> longer serialized.
> In summary, the old code consistently used rddId, and just incorrectly named 
> it "stageId".
> The new code uses a mix of rddId and stageId. There should be a consistent ID 
> between executors and the drivers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol

2017-12-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-22162:
---
Affects Version/s: (was: 2.3.0)

> Executors and the driver use inconsistent Job IDs during the new RDD commit 
> protocol
> 
>
> Key: SPARK-22162
> URL: https://issues.apache.org/jira/browse/SPARK-22162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>Assignee: Reza Safi
> Fix For: 2.3.0
>
>
> After SPARK-18191 commit in pull request 15769, using the new commit protocol 
> it is possible that driver and executors uses different jobIds during a rdd 
> commit.
> In the old code, the variable stageId is part of the closure used to define 
> the task as you can see here:
>  
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098]
> As a result, a TaskAttemptId is constructed in executors using the same 
> "stageId" as the driver, since it is a value that is serialized in the 
> driver. Also the value of stageID is actually the rdd.id which is assigned 
> here: 
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084]
> However, after the change in pull request 15769, the value is no longer part 
> of the task closure, which gets serialized by the driver. Instead, it is 
> pulled from the taskContext as you can see 
> here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103]
> and then that value is used to construct the TaskAttemptId on the executors: 
> [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134]
> taskContext has a stageID value which will be set in DAGScheduler. So after 
> the change unlike the old code which a rdd.id was used, an actual stage.id is 
> used which can be different between executors and the driver since it is no 
> longer serialized.
> In summary, the old code consistently used rddId, and just incorrectly named 
> it "stageId".
> The new code uses a mix of rddId and stageId. There should be a consistent ID 
> between executors and the drivers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22626) Wrong Hive table statistics may trigger OOM if enables CBO

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276988#comment-16276988
 ] 

Apache Spark commented on SPARK-22626:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19880

> Wrong Hive table statistics may trigger OOM if enables CBO
> --
>
> Key: SPARK-22626
> URL: https://issues.apache.org/jira/browse/SPARK-22626
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> How to reproduce:
> {code}
> bin/spark-shell --conf spark.sql.cbo.enabled=true
> {code}
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.sql("CREATE TABLE small (c1 bigint) TBLPROPERTIES ('numRows'='3', 
> 'rawDataSize'='600','totalSize'='800')")
> // Big table with wrong statistics, numRows=0
> spark.sql("CREATE TABLE big (c1 bigint) TBLPROPERTIES ('numRows'='0', 
> 'rawDataSize'='600', 'totalSize'='8')")
> val plan = spark.sql("select * from small t1 join big t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
> val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
> println(buildSide)
> {code}
> The result is {{BuildRight}}, but the right side is the big table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20706) Spark-shell not overriding method/variable definition

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20706:


Assignee: (was: Apache Spark)

> Spark-shell not overriding method/variable definition
> -
>
> Key: SPARK-20706
> URL: https://issues.apache.org/jira/browse/SPARK-20706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
> Environment: Linux, Scala 2.11.8
>Reporter: Raphael Roth
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png!In the following example, the definition of myMethod is not 
> correctly updated:
> --
> def myMethod()  = "first definition"
> val tmp = myMethod(); val out = tmp
> println(out) // prints "first definition"
> def myMethod()  = "second definition" // override above myMethod
> val tmp = myMethod(); val out = tmp 
> println(out) // should be "second definition" but is "first definition"
> --
> I'm using semicolon to force two statements to be compiled at the same time. 
> It's also possible to reproduce the behavior using :paste
> So if I-redefine myMethod, the implementation seems not to be updated in this 
> case. I figured out that the second-last statement (val out = tmp) causes 
> this behavior, if this is moved in a separate block, the code works just fine.
> EDIT:
> The same behavior can be seen when declaring variables :
> --
> val a = 1
> val b = a; val c = b;
> println(b) // prints "1"
> val a = 2 // override a
> val b = a; val c = b;
> println(b) // prints "1" instead of "2"
> --
> Interestingly, if the second-last line "val b = a; val c = b;" is executed 
> twice, then I get the expected result



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20706) Spark-shell not overriding method/variable definition

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276948#comment-16276948
 ] 

Apache Spark commented on SPARK-20706:
--

User 'mpetruska' has created a pull request for this issue:
https://github.com/apache/spark/pull/19879

> Spark-shell not overriding method/variable definition
> -
>
> Key: SPARK-20706
> URL: https://issues.apache.org/jira/browse/SPARK-20706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
> Environment: Linux, Scala 2.11.8
>Reporter: Raphael Roth
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png!In the following example, the definition of myMethod is not 
> correctly updated:
> --
> def myMethod()  = "first definition"
> val tmp = myMethod(); val out = tmp
> println(out) // prints "first definition"
> def myMethod()  = "second definition" // override above myMethod
> val tmp = myMethod(); val out = tmp 
> println(out) // should be "second definition" but is "first definition"
> --
> I'm using semicolon to force two statements to be compiled at the same time. 
> It's also possible to reproduce the behavior using :paste
> So if I-redefine myMethod, the implementation seems not to be updated in this 
> case. I figured out that the second-last statement (val out = tmp) causes 
> this behavior, if this is moved in a separate block, the code works just fine.
> EDIT:
> The same behavior can be seen when declaring variables :
> --
> val a = 1
> val b = a; val c = b;
> println(b) // prints "1"
> val a = 2 // override a
> val b = a; val c = b;
> println(b) // prints "1" instead of "2"
> --
> Interestingly, if the second-last line "val b = a; val c = b;" is executed 
> twice, then I get the expected result



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20706) Spark-shell not overriding method/variable definition

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20706:


Assignee: Apache Spark

> Spark-shell not overriding method/variable definition
> -
>
> Key: SPARK-20706
> URL: https://issues.apache.org/jira/browse/SPARK-20706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
> Environment: Linux, Scala 2.11.8
>Reporter: Raphael Roth
>Assignee: Apache Spark
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png!In the following example, the definition of myMethod is not 
> correctly updated:
> --
> def myMethod()  = "first definition"
> val tmp = myMethod(); val out = tmp
> println(out) // prints "first definition"
> def myMethod()  = "second definition" // override above myMethod
> val tmp = myMethod(); val out = tmp 
> println(out) // should be "second definition" but is "first definition"
> --
> I'm using semicolon to force two statements to be compiled at the same time. 
> It's also possible to reproduce the behavior using :paste
> So if I-redefine myMethod, the implementation seems not to be updated in this 
> case. I figured out that the second-last statement (val out = tmp) causes 
> this behavior, if this is moved in a separate block, the code works just fine.
> EDIT:
> The same behavior can be seen when declaring variables :
> --
> val a = 1
> val b = a; val c = b;
> println(b) // prints "1"
> val a = 2 // override a
> val b = a; val c = b;
> println(b) // prints "1" instead of "2"
> --
> Interestingly, if the second-last line "val b = a; val c = b;" is executed 
> twice, then I get the expected result



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22682) HashExpression does not need to create global variables

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22682:


Assignee: Wenchen Fan  (was: Apache Spark)

> HashExpression does not need to create global variables
> ---
>
> Key: SPARK-22682
> URL: https://issues.apache.org/jira/browse/SPARK-22682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22682) HashExpression does not need to create global variables

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276945#comment-16276945
 ] 

Apache Spark commented on SPARK-22682:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19878

> HashExpression does not need to create global variables
> ---
>
> Key: SPARK-22682
> URL: https://issues.apache.org/jira/browse/SPARK-22682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22682) HashExpression does not need to create global variables

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22682:


Assignee: Apache Spark  (was: Wenchen Fan)

> HashExpression does not need to create global variables
> ---
>
> Key: SPARK-22682
> URL: https://issues.apache.org/jira/browse/SPARK-22682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22682) HashExpression does not need to create global variables

2017-12-04 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-22682:
---

 Summary: HashExpression does not need to create global variables
 Key: SPARK-22682
 URL: https://issues.apache.org/jira/browse/SPARK-22682
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20706) Spark-shell not overriding method/variable definition

2017-12-04 Thread Mark Petruska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276921#comment-16276921
 ] 

Mark Petruska commented on SPARK-20706:
---

This is a Scala repl bug, see: https://github.com/scala/bug/issues/9740. The 
fix for this made it into Scala 2.11.9.
Basically it affects "class-based" Scala-shells, which is used in Spark-shell.
Creating the PR for the fix.

> Spark-shell not overriding method/variable definition
> -
>
> Key: SPARK-20706
> URL: https://issues.apache.org/jira/browse/SPARK-20706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
> Environment: Linux, Scala 2.11.8
>Reporter: Raphael Roth
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png!In the following example, the definition of myMethod is not 
> correctly updated:
> --
> def myMethod()  = "first definition"
> val tmp = myMethod(); val out = tmp
> println(out) // prints "first definition"
> def myMethod()  = "second definition" // override above myMethod
> val tmp = myMethod(); val out = tmp 
> println(out) // should be "second definition" but is "first definition"
> --
> I'm using semicolon to force two statements to be compiled at the same time. 
> It's also possible to reproduce the behavior using :paste
> So if I-redefine myMethod, the implementation seems not to be updated in this 
> case. I figured out that the second-last statement (val out = tmp) causes 
> this behavior, if this is moved in a separate block, the code works just fine.
> EDIT:
> The same behavior can be seen when declaring variables :
> --
> val a = 1
> val b = a; val c = b;
> println(b) // prints "1"
> val a = 2 // override a
> val b = a; val c = b;
> println(b) // prints "1" instead of "2"
> --
> Interestingly, if the second-last line "val b = a; val c = b;" is executed 
> twice, then I get the expected result



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-12-04 Thread Sasaki Toru (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276892#comment-16276892
 ] 

Sasaki Toru commented on SPARK-20050:
-

Thank you comment.
I think this patch can be backported to branch-2.1 and will fix same issue.

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-12-04 Thread Sasaki Toru (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276892#comment-16276892
 ] 

Sasaki Toru edited comment on SPARK-20050 at 12/4/17 2:54 PM:
--

Thank you comment.
I think this patch can be backported to branch-2.1 and will fix same issue in 
version 2.1.


was (Author: sasakitoa):
Thank you comment.
I think this patch can be backported to branch-2.1 and will fix same issue.

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276759#comment-16276759
 ] 

Apache Spark commented on SPARK-1940:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/895

> Enable rolling of executor logs (stdout / stderr)
> -
>
> Key: SPARK-1940
> URL: https://issues.apache.org/jira/browse/SPARK-1940
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.1.0
>
>
> Currently, in the default log4j configuration, all the executor logs get sent 
> to the file [executor-working-dir]/stderr. This does not all log 
> files to be rolled, so old logs cannot be removed. 
> Using log4j RollingFileAppender allows log4j logs to be rolled, but all the 
> logs get sent to a different set of files, other than the files 
> stdout and stderr . So the logs are not visible in 
> the Spark web UI any more as Spark web UI only reads the files 
> stdout and stderr. Furthermore, it still does not 
> allow the stdout and stderr to be cleared periodically in case a large amount 
> of stuff gets written to them (e.g. by explicit println inside map function).
> Solving this requires rolling of the logs in such a way that Spark web UI is 
> aware of it and can retrieve the logs across the rolled-over files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22681) Accumulator should only be updated once for each task in result stage

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22681:


Assignee: Apache Spark

> Accumulator should only be updated once for each task in result stage
> -
>
> Key: SPARK-22681
> URL: https://issues.apache.org/jira/browse/SPARK-22681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Carson Wang
>Assignee: Apache Spark
>
> As the doc says "For accumulator updates performed inside actions only, Spark 
> guarantees that each task’s update to the accumulator will only be applied 
> once, i.e. restarted tasks will not update the value."
> But currently the code doesn't guarantee this. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22681) Accumulator should only be updated once for each task in result stage

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276718#comment-16276718
 ] 

Apache Spark commented on SPARK-22681:
--

User 'carsonwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19877

> Accumulator should only be updated once for each task in result stage
> -
>
> Key: SPARK-22681
> URL: https://issues.apache.org/jira/browse/SPARK-22681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Carson Wang
>
> As the doc says "For accumulator updates performed inside actions only, Spark 
> guarantees that each task’s update to the accumulator will only be applied 
> once, i.e. restarted tasks will not update the value."
> But currently the code doesn't guarantee this. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22681) Accumulator should only be updated once for each task in result stage

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22681:


Assignee: (was: Apache Spark)

> Accumulator should only be updated once for each task in result stage
> -
>
> Key: SPARK-22681
> URL: https://issues.apache.org/jira/browse/SPARK-22681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Carson Wang
>
> As the doc says "For accumulator updates performed inside actions only, Spark 
> guarantees that each task’s update to the accumulator will only be applied 
> once, i.e. restarted tasks will not update the value."
> But currently the code doesn't guarantee this. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22681) Accumulator should only be updated once for each task in result stage

2017-12-04 Thread Carson Wang (JIRA)

Carson Wang created SPARK-22681:
---

 Summary: Accumulator should only be updated once for each task in 
result stage
 Key: SPARK-22681
 URL: https://issues.apache.org/jira/browse/SPARK-22681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Carson Wang


As the doc says "For accumulator updates performed inside actions only, Spark 
guarantees that each task’s update to the accumulator will only be applied 
once, i.e. restarted tasks will not update the value."
But currently the code doesn't guarantee this. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22680) SparkSQL scan all partitions when the specified partitions are not exists in parquet formatted table

2017-12-04 Thread Xiaochen Ouyang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-22680:

Summary: SparkSQL scan all partitions when the specified partitions are not 
exists in parquet formatted table  (was: SparkSQL scan all partitions when 
specified partition is not exists in parquet formatted table)

> SparkSQL scan all partitions when the specified partitions are not exists in 
> parquet formatted table
> 
>
> Key: SPARK-22680
> URL: https://issues.apache.org/jira/browse/SPARK-22680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: spark2.0.2 spark2.2.0
>Reporter: Xiaochen Ouyang
>
> 1. spark-sql --master local[2]
> 2. create external table test (id int,name string) partitioned by (country 
> string,province string, day string,hour int) stored as parquet localtion 
> '/warehouse/test';
> 3.produce data into table test
> 4. select count(1) from test where country = '185' and province = '021' and 
> day = '2017-11-12' and hour = 10; if the 4 filter conditions are not exists 
> in HDFS and MetaStore[mysql] , this sql will scan all partitions in table test



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22680) SparkSQL scan all partitions when specified partition is not exists in parquet formatted table

2017-12-04 Thread Xiaochen Ouyang (JIRA)

Xiaochen Ouyang created SPARK-22680:
---

 Summary: SparkSQL scan all partitions when specified partition is 
not exists in parquet formatted table
 Key: SPARK-22680
 URL: https://issues.apache.org/jira/browse/SPARK-22680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.0.2
 Environment: spark2.0.2 spark2.2.0
Reporter: Xiaochen Ouyang


1. spark-sql --master local[2]
2. create external table test (id int,name string) partitioned by (country 
string,province string, day string,hour int) stored as parquet localtion 
'/warehouse/test';
3.produce data into table test
4. select count(1) from test where country = '185' and province = '021' and day 
= '2017-11-12' and hour = 10; if the 4 filter conditions are not exists in HDFS 
and MetaStore[mysql] , this sql will scan all partitions in table test




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11239) PMML export for ML linear regression

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11239:


Assignee: Apache Spark

> PMML export for ML linear regression
> 
>
> Key: SPARK-11239
> URL: https://issues.apache.org/jira/browse/SPARK-11239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>Assignee: Apache Spark
>
> Add PMML export for linear regression models form the ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11239) PMML export for ML linear regression

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276643#comment-16276643
 ] 

Apache Spark commented on SPARK-11239:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19876

> PMML export for ML linear regression
> 
>
> Key: SPARK-11239
> URL: https://issues.apache.org/jira/browse/SPARK-11239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>
> Add PMML export for linear regression models form the ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11239) PMML export for ML linear regression

2017-12-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11239:


Assignee: (was: Apache Spark)

> PMML export for ML linear regression
> 
>
> Key: SPARK-11239
> URL: https://issues.apache.org/jira/browse/SPARK-11239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>
> Add PMML export for linear regression models form the ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11171) PMML for Pipelines API

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276640#comment-16276640
 ] 

Apache Spark commented on SPARK-11171:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19876

> PMML for Pipelines API
> --
>
> Key: SPARK-11171
> URL: https://issues.apache.org/jira/browse/SPARK-11171
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>
> We need to add PMML export to the spark.ml Pipelines API.
> We should make 1 subtask JIRA per model.  Hopefully we can reuse the 
> underlying implementation, adding simple wrappers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date

2017-12-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276629#comment-16276629
 ] 

Apache Spark commented on SPARK-22473:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19875

> Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
> --
>
> Key: SPARK-22473
> URL: https://issues.apache.org/jira/browse/SPARK-22473
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
> Fix For: 2.3.0
>
>
> In `spark-sql` module tests there are deprecations warnings caused by the 
> usage of deprecated methods of `java.sql.Date` and the usage of the 
> deprecated `AsyncAssertions.Waiter` class.
> This issue is to track their replacement with their respective non-deprecated 
> versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22365) Spark UI executors empty list with 500 error

2017-12-04 Thread bruce xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618
 ] 

bruce xu commented on SPARK-22365:
--

Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark 
ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I 
will also try to find the reason. Maybe it's a bug anyway.

> Spark UI executors empty list with 500 error
> 
>
> Key: SPARK-22365
> URL: https://issues.apache.org/jira/browse/SPARK-22365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
> Attachments: spark-executor-500error.png
>
>
> No data loaded on "execturos" tab in sparkUI with stack trace below. Apart 
> from exception I have nothing more. But if I can test something to make this 
> easier to resolve I am happy to help.
> {code}
> java.lang.NullPointerException
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22660) Compile with scala-2.12 and JDK9

2017-12-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276583#comment-16276583
 ] 

Sean Owen commented on SPARK-22660:
---

You keep changing what this JIRA is about . There are too many JDK 9 issues for 
one. Please change this to match the scope of the PR you opened. After that 
identify another logical change or fix.

But here you are already Hadoop 2 won't work with Java 9.

> Compile with scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Priority: Minor
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
>  ./dev/change-scala-version.sh 2.12
> 2.build with -Pscala-2.12
> for hive on spark
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> for spark sql
> {code}
> ./dev/make-distribution.sh  --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn -Phive 
> -Dhadoop.version=2.7.3>log.sparksql 2>&1
> {code}
> get following error
> #Error1
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix
> #Error2
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}
> The limit method was moved from ByteBuffer to the superclass Buffer and it 
> can no longer be called without (). The same reason for position method.
> #Error3
> {code}
> home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error] properties.putAll(propsMap.asJava)
>  [error]^
> [error] 
> /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error]   props.putAll(outputSerdeProps.toMap.asJava)
>  [error] ^
>  {code}
>  This is because the key type is Object instead of String which is unsafe.
> After solving these 3 errors, compile successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2017-12-04 Thread Omer van Kloeten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276569#comment-16276569
 ] 

Omer van Kloeten commented on SPARK-22634:
--

Understandable, but since Bouncy Castle may be used by users of Spark 
transitively, they either evict (in which case there may be unforeseen 
consequences) or are using a very old version with known CVEs which may affect 
their code.

I'd recommend including it in a maintenance release and it being prominently 
displayed in the release notes.

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22365) Spark UI executors empty list with 500 error

2017-12-04 Thread Jakub Dubovsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276564#comment-16276564
 ] 

Jakub Dubovsky commented on SPARK-22365:


In my instance it looks like it is a result of some dependency version 
conflict. I submit my spark using [spark 
notebook|https://github.com/spark-notebook/spark-notebook]. Since that is a web 
application as well it conflicts with spark UI somehow. I will dig deeper once 
this is closer to top of my back log...

[~xwc3504] Thanks for posting this here! What kind of setup do you have? Do you 
use spark notebook as well?

> Spark UI executors empty list with 500 error
> 
>
> Key: SPARK-22365
> URL: https://issues.apache.org/jira/browse/SPARK-22365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
> Attachments: spark-executor-500error.png
>
>
> No data loaded on "execturos" tab in sparkUI with stack trace below. Apart 
> from exception I have nothing more. But if I can test something to make this 
> easier to resolve I am happy to help.
> {code}
> java.lang.NullPointerException
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22670) Not able to create table in HIve with SparkSession when JavaSparkContext is already initialized.

2017-12-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22670.
---
Resolution: Not A Problem

That's an issue with the design of your app then.

> Not able to create table in HIve with SparkSession when JavaSparkContext is 
> already initialized.
> 
>
> Key: SPARK-22670
> URL: https://issues.apache.org/jira/browse/SPARK-22670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Naresh Meena
>
> Not able to create table in Hive with SparkSession when SparkContext is 
> already initialized.
> Below is the code snippet and error logs.
> JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
> SparkSession hiveCtx = SparkSession
>   .builder()
>   
> .config(HiveConf.ConfVars.METASTOREURIS.toString(),
>   "..:9083")
>   .config("spark.sql.warehouse.dir", 
> "/apps/hive/warehouse")
>   .enableHiveSupport().getOrCreate();
> 2017-11-29 13:11:33 Driver [ERROR] SparkBatchSubmitter - Failed to start the 
> driver for Batch_JDBC_PipelineTest
> org.apache.spark.sql.AnalysisException: 
> Hive support is required to insert into the following tables:
> `default`.`testhivedata`
>;;
> 'InsertIntoTable 'SimpleCatalogRelation default, CatalogTable(
>   Table: `default`.`testhivedata`
>   Created: Wed Nov 29 13:11:33 IST 2017
>   Last Access: Thu Jan 01 05:29:59 IST 1970
>   Type: MANAGED
>   Schema: [StructField(empID,LongType,true), 
> StructField(empDate,DateType,true), StructField(empName,StringType,true), 
> StructField(empSalary,DoubleType,true), 
> StructField(empLocation,StringType,true), 
> StructField(empConditions,BooleanType,true), 
> StructField(empCity,StringType,true), 
> StructField(empSystemIP,StringType,true)]
>   Provider: hive
>   Storage(Location: 
> file:/hadoop/yarn/local/usercache/sax/appcache/application_1511627000183_0190/container_e34_1511627000183_0190_01_01/spark-warehouse/testhivedata,
>  InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> OverwriteOptions(false,Map()), false
> +- LogicalRDD [empID#49L, empDate#50, empName#51, empSalary#52, 
> empLocation#53, empConditions#54, empCity#55, empSystemIP#56]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:405)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:76)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
>   at 
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:263)
>   at 
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:243)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2017-12-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276556#comment-16276556
 ] 

Sean Owen commented on SPARK-22634:
---

I'm hesitant to do that in a maintenance branch because it's a minor version 
change. I don't see info on CVEs relevant to Spark either. 

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7953) Spark should cleanup output dir if job fails

2017-12-04 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276535#comment-16276535
 ] 

Nandor Kollar commented on SPARK-7953:
--

[~joshrosen] could you please help me with this issue, is this still an 
outstanding bug? It looks like Spark. 2.2 already includes SPARK-18219, and it 
seems that the new commit protocol calls abortJob and abortTask.

> Spark should cleanup output dir if job fails
> 
>
> Key: SPARK-7953
> URL: https://issues.apache.org/jira/browse/SPARK-7953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Mohit Sabharwal
>
> MR calls abortTask and abortJob on the {{OutputCommitter}} to clean up the 
> temporary output directories, but Spark doesn't seem to be doing that (when 
> outputting an RDD to a Hadoop FS)
> For example: {{PairRDDFunctions.saveAsNewAPIHadoopDataset}} should call 
> {{committer.abortTask(hadoopContext)}} in the finally block inside the 
> writeShard closure. And also {{jobCommitter.abortJob(jobTaskContext, 
> JobStatus.State.FAILED)}} should be called if the job fails.
> Additionally, MR removes the output dir if job fails, but Spark doesn't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2017-12-04 Thread Omer van Kloeten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276492#comment-16276492
 ] 

Omer van Kloeten commented on SPARK-22634:
--

[~srowen], thanks for taking this up. However, this seems like more of a fix 
for 2.2.1 than for 2.3.0, since Bouncy Castle is a crypto library and 1.51 -> 
1.58 contains fixes for numerous CVEs.

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22286) OutOfMemoryError caused by memory leak and large serializer batch size in ExternalAppendOnlyMap

2017-12-04 Thread Lijie Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Xu updated SPARK-22286:
-
Description: 
*[Abstract]* 

I recently encountered an OOM error in a simple _groupByKey_ application. After 
profiling the application, I found the OOM error is related to the shuffle 
spill and records (de)serialization. After analyzing the OOM heap dump, I found 
the root causes are (1) memory leak in ExternalAppendOnlyMap, (2) large static 
serializer batch size (_spark.shuffle.spill.batchSize_ =1) defined in 
ExternalAppendOnlyMap, and (3) memory leak in the deserializer. Since almost 
all the Spark applications rely on ExternalAppendOnlyMap to perform shuffle and 
reduce, this is a critical bug/defect. In the following sections, I will detail 
the testing application, data, environment, failure symptoms, diagnosing 
procedure, identified root causes, and potential solutions.

*[Application]* 

This is a simple GroupBy application as follows.
{code}
table.map(row => (row.sourceIP[1,7], row)).groupByKey().saveAsTextFile()
{code}

The _sourceIP_ (an IP address like 127.100.101.102) is a column of the 
_UserVisits_ table. This application has the same logic as the aggregation 
query in Berkeley SQL benchmark (https://amplab.cs.berkeley.edu/benchmark/) as 
follows. 
{code}
  SELECT * FROM UserVisits
  GROUP BY SUBSTR(sourceIP, 1, 7);
{code}
The application code is available at \[1\].

*[Data]* 

The UserVisits table size is 16GB (9 columns, 132,000,000 rows) with uniform 
distribution. The HDFS block size is 128MB. The data generator is available at 
\[2\].

*[Environment]* 

Spark 2.1 (Spark 2.2 may also have this error), Oracle Java Hotspot 1.8.0, 1 
master and 8 workers as follows.

!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Workers.png|width=100%!

This application launched 32 executors. Each executor has 1 core and 7GB 
memory. The detailed application configuration is
{code}
   total-executor-cores = 32
   executor-cores = 1 
   executor-memory = 7G
   spark.default.parallelism=32 
   spark.serializer = JavaSerializer (KryoSerializer also has OOM error)
{code}

*[Failure symptoms]*

This application has a map stage and a reduce stage. An OOM error occurs in a 
reduce task (Task-17) as follows.

!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Stage.png|width=100%!
!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Tasks.png|width=100%!
 
Task-17 generated an OOM error. It shuffled ~1GB data and spilled 3.6GB data 
onto the disk.

Task-17 log below shows that this task is reading the next record by invoking 
_ExternalAppendOnlyMap.hasNext_(). From the OOM stack traces and the above 
shuffle metrics, we cannot identify the OOM root causes. 
!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/OOMStackTrace.png|width=100%!
 
A question is that why Task-17 still suffered OOM errors even after spilling 
large in-memory data onto the disk.

*[Diagnosing procedure]*

Since each executor has 1 core and 7GB, it runs only one task at a time and the 
task memory usage is going to exceed 7GB.

*1: Identify the error phase*
I added some debug logs in Spark, and found that the error phase is not the 
spill phase but the memory-disk-merge phase. 
The memory-disk-merge phase: Spark reads back the spilled records (as shown in 
① Figure 1), merges the spilled records with the in-memory records  (as shown 
in ②), generates new records, and output the new records onto HDFS (as shown in 
③).

*2. Dataflow and memory usage analysis*
I added some profiling code and obtained dataflow and memory usage metrics as 
follows. Ki represents the _i_-th key, Ri represents the _i_-th row in the 
table.
!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/DataflowAndMemoryUsage.png|width=100%!
  Figure 1: Dataflow and Memory Usage Analysis (see 
https://github.com/JerryLead/Misc/blob/master/SparkPRFigures/OOM/SPARK-22286-OOM.pdf
 for the high-definition version)


The concrete phases with metrics are as follows.

*[Shuffle read]* records = 7,540,235, bytes = 903 MB
*[In-memory store]* As shown in the following log, about 5,243,424 of the 
7,540,235 records are aggregated to 60  records in AppendOnlyMap. 
Each  record is about 60MB. There are only 60 distinct keys in the 
shuffled records.

!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/SpilledRecords.png|width=100%!

*[Spill]* Since 3.6 GB has achieved the spill threshold, Spark spills the 60 
records onto the disk. Since _60 < serializerBatchSize_ (default 10,000), all 
the 60 records are serialized into the SerializeBuffer and then written onto 
the disk as a file segment. The 60 serialized records are about 581 MB (this is 
an estimated size, while the real size maybe larger).

[jira] [Updated] (SPARK-22286) OutOfMemoryError caused by memory leak and large serializer batch size in ExternalAppendOnlyMap

2017-12-04 Thread Lijie Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Xu updated SPARK-22286:
-
Description: 
*[Abstract]* 

I recently encountered an OOM error in a simple _groupByKey_ application. After 
profiling the application, I found the OOM error is related to the shuffle 
spill and records (de)serialization. After analyzing the OOM heap dump, I found 
the root causes are (1) memory leak in ExternalAppendOnlyMap, (2) large static 
serializer batch size (_spark.shuffle.spill.batchSize_ =1) defined in 
ExternalAppendOnlyMap, and (3) memory leak in the deserializer. Since almost 
all the Spark applications rely on ExternalAppendOnlyMap to perform shuffle and 
reduce, this is a critical bug/defect. In the following sections, I will detail 
the testing application, data, environment, failure symptoms, diagnosing 
procedure, identified root causes, and potential solutions.

*[Application]* 

This is a simple GroupBy application as follows.
{code}
table.map(row => (row.sourceIP[1,7], row)).groupByKey().saveAsTextFile()
{code}

The _sourceIP_ (an IP address like 127.100.101.102) is a column of the 
_UserVisits_ table. This application has the same logic as the aggregation 
query in Berkeley SQL benchmark (https://amplab.cs.berkeley.edu/benchmark/) as 
follows. 
{code}
  SELECT * FROM UserVisits
  GROUP BY SUBSTR(sourceIP, 1, 7);
{code}
The application code is available at \[1\].

*[Data]* 

The UserVisits table size is 16GB (9 columns, 132,000,000 rows) with uniform 
distribution. The HDFS block size is 128MB. The data generator is available at 
\[2\].

*[Environment]* 

Spark 2.1 (Spark 2.2 may also have this error), Oracle Java Hotspot 1.8.0, 1 
master and 8 workers as follows.

!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Workers.png|width=100%!

This application launched 32 executors. Each executor has 1 core and 7GB 
memory. The detailed application configuration is
{code}
   total-executor-cores = 32
   executor-cores = 1 
   executor-memory = 7G
   spark.default.parallelism=32 
   spark.serializer = JavaSerializer (KryoSerializer also has OOM error)
{code}

*[Failure symptoms]*

This application has a map stage and a reduce stage. An OOM error occurs in a 
reduce task (Task-17) as follows.

!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Stage.png|width=100%!
!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Tasks.png|width=100%!
 
Task-17 generated an OOM error. It shuffled ~1GB data and spilled 3.6GB data 
onto the disk.

Task-17 log below shows that this task is reading the next record by invoking 
_ExternalAppendOnlyMap.hasNext_(). From the OOM stack traces and the above 
shuffle metrics, we cannot identify the OOM root causes. 
!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/OOMStackTrace.png|width=100%!
 
A question is that why Task-17 still suffered OOM errors even after spilling 
large in-memory data onto the disk.

*[Diagnosing procedure]*

Since each executor has 1 core and 7GB, it runs only one task at a time and the 
task memory usage is going to exceed 7GB.

*1: Identify the error phase*
I added some debug logs in Spark, and found that the error phase is not the 
spill phase but the memory-disk-merge phase. 
The memory-disk-merge phase: Spark reads back the spilled records (as shown in 
① Figure 1), merges the spilled records with the in-memory records  (as shown 
in ②), generates new records, and output the new records onto HDFS (as shown in 
③).

*2. Dataflow and memory usage analysis*
I added some profiling code and obtained dataflow and memory usage metrics as 
follows. Ki represents the _i_-th key, Ri represents the _i_-th row in the 
table.
!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/DataflowAndMemoryUsage.png|width=100%!
  Figure 1: Dataflow and Memory Usage Analysis (see 
https://github.com/JerryLead/Misc/blob/master/SparkPRFigures/OOM/SPARK-22286-OOM.pdf
 for the high-definition version)


The concrete phases with metrics are as follows.

*[Shuffle read]* records = 7,540,235, bytes = 903 MB
*[In-memory store]* As shown in the following log, about 5,243,424 of the 
7,540,235 records are aggregated to 60  records in AppendOnlyMap. 
Each  record is about 60MB. There are only 60 distinct keys in the 
shuffled records.

!https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/SpilledRecords.png|width=100%!

*[Spill]* Since 3.6 GB has achieved the spill threshold, Spark spills the 60 
records onto the disk. Since _60 < serializerBatchSize_ (default 10,000), all 
the 60 records are serialized into the SerializeBuffer and then written onto 
the disk as a file segment. The 60 serialized records are about 581 MB (this is 
an estimated size, while the real size maybe larger).

[jira] [Updated] (SPARK-22675) Refactoring PropagateTypes in TypeCoercion

2017-12-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22675:

Description: PropagateTypes are called at the beginning of TypeCocercion 
and then called at the end of TypeCocercion. Instead, we should call it in each 
rule that could change the data types for propagating the type changes above 
the parents.   (was: PropagateTypes are called twice in TypeCoercion. We do not 
need to call it twice. Instead, we should call it after each change on the 
types. )

> Refactoring PropagateTypes in TypeCoercion
> --
>
> Key: SPARK-22675
> URL: https://issues.apache.org/jira/browse/SPARK-22675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> PropagateTypes are called at the beginning of TypeCocercion and then called 
> at the end of TypeCocercion. Instead, we should call it in each rule that 
> could change the data types for propagating the type changes above the 
> parents. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

99 matches

Mail list logo