[jira] [Commented] (SPARK-2652) Turning default configurations for PySpark
[ https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072909#comment-14072909 ] Apache Spark commented on SPARK-2652: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1568 > Turning default configurations for PySpark > -- > > Key: SPARK-2652 > URL: https://issues.apache.org/jira/browse/SPARK-2652 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Labels: Configuration, Python > Fix For: 1.1.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Some default value of configuration does not make sense for PySpark, change > them to reasonable ones, such as spark.serializer and > spark.kryo.referenceTracking -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2661) Unpersist last RDD in bagel iteration
[ https://issues.apache.org/jira/browse/SPARK-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2661. -- Resolution: Fixed > Unpersist last RDD in bagel iteration > - > > Key: SPARK-2661 > URL: https://issues.apache.org/jira/browse/SPARK-2661 > Project: Spark > Issue Type: Improvement >Reporter: Adrian Wang >Assignee: Adrian Wang > Fix For: 1.1.0 > > > In bagel iteration, we only depend on RDD[n] to get RDD[n+1], so we can > unpersist RDD[n-1] after we get RDD[n]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags
[ https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072925#comment-14072925 ] Sandy Ryza commented on SPARK-2664: --- I think the right behavior here is worth a little thought. What's the mental model we expect the user to have about the relationship between properties specified through --conf and properties that get their own flag? My first thought is - if we're ok with taking properties like master through --conf, is there a point (beyond compatibility) in having flags for these properties at all? Flags that aren't conf are there because they impact what happens before the SparkContext is created. These fall into a couple categories: 1. Flags that have no property Spark conf equivalent like --executor-cores 2. Flags that have a direct Spark conf equivalent like --executor-cores (spark.executor.memory) 3. Flags that impact a Spark conf like --deploy-mode (which can mean we set spark.master to yarn-cluster) I think the two ways to look at it are: 1. We're OK with taking properties that have related flags. In the case of a property in the 2nd category, we have a policy over which takes precedence. In the case of a property in the 3rd category, we have some (possibly complex) resolution logic. This approach would be the most accepting, but requires the user to have a model of how these conflicts get resolved. 2. We're not OK with taking properties that have related flags. --conf specifies property that gets passed to the SparkContext and has no effect on anything that happens before it's created. To save users from themselves, if someone passes spark.master or spark.app.name through --conf, we ignore it or throw an error. I'm a little more partial to approach 2 because I think the mental model is a little simpler. Either way, we should probably enforce the same behavior when a config comes from the defaults file. Lastly, how do we allow setting a default for one of these special flags? E.g. make it so that all jobs run on YARN or Mesos by default. With approach 1, this is relatively straightforward - we use the same logic we'd use on a property that comes in through --conf for making defaults take effect. We might need to add spark properties for flags that don't have them already like --executor-cores. With approach 2, we'd need to add support in the defaults file or somewhere else for specifying flag defaults. > Deal with `--conf` options in spark-submit that relate to flags > --- > > Key: SPARK-2664 > URL: https://issues.apache.org/jira/browse/SPARK-2664 > Project: Spark > Issue Type: Bug >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Blocker > > If someone sets a spark conf that relates to an existing flag `--master`, we > should set it correctly like we do with the defaults file. Otherwise it can > have confusing semantics. I noticed this after merging it, otherwise I would > have mentioned it in the review. > I think it's as simple as modifying loadDefaults to check the user-supplied > options also. We might change it to loadUserProperties since it's no longer > just the defaults file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags
[ https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072925#comment-14072925 ] Sandy Ryza edited comment on SPARK-2664 at 7/24/14 7:18 AM: I think the right behavior here is worth a little thought. What's the mental model we expect the user to have about the relationship between properties specified through --conf and properties that get their own flag? My first thought is - if we're ok with taking properties like master through --conf, is there a point (beyond compatibility) in having flags for these properties at all? Flags that aren't Spark confs are there because they impact what happens before the SparkContext is created. These fall into a couple categories: 1. Flags that have no property Spark conf equivalent like --executor-cores 2. Flags that have a direct Spark conf equivalent like --executor-cores (spark.executor.memory) 3. Flags that impact a Spark conf like --deploy-mode (which can mean we set spark.master to yarn-cluster) I think the two ways to look at it are: 1. We're OK with taking properties that have related flags. In the case of a property in the 2nd category, we have a policy over which takes precedence. In the case of a property in the 3rd category, we have some (possibly complex) resolution logic. This approach would be the most accepting, but requires the user to have a model of how these conflicts get resolved. 2. We're not OK with taking properties that have related flags. --conf specifies property that gets passed to the SparkContext and has no effect on anything that happens before it's created. To save users from themselves, if someone passes spark.master or spark.app.name through --conf, we ignore it or throw an error. I'm a little more partial to approach 2 because I think the mental model is a little simpler. Either way, we should probably enforce the same behavior when a config comes from the defaults file. Lastly, how do we allow setting a default for one of these special flags? E.g. make it so that all jobs run on YARN or Mesos by default. With approach 1, this is relatively straightforward - we use the same logic we'd use on a property that comes in through --conf for making defaults take effect. We might need to add spark properties for flags that don't have them already like --executor-cores. With approach 2, we'd need to add support in the defaults file or somewhere else for specifying flag defaults. was (Author: sandyr): I think the right behavior here is worth a little thought. What's the mental model we expect the user to have about the relationship between properties specified through --conf and properties that get their own flag? My first thought is - if we're ok with taking properties like master through --conf, is there a point (beyond compatibility) in having flags for these properties at all? Flags that aren't conf are there because they impact what happens before the SparkContext is created. These fall into a couple categories: 1. Flags that have no property Spark conf equivalent like --executor-cores 2. Flags that have a direct Spark conf equivalent like --executor-cores (spark.executor.memory) 3. Flags that impact a Spark conf like --deploy-mode (which can mean we set spark.master to yarn-cluster) I think the two ways to look at it are: 1. We're OK with taking properties that have related flags. In the case of a property in the 2nd category, we have a policy over which takes precedence. In the case of a property in the 3rd category, we have some (possibly complex) resolution logic. This approach would be the most accepting, but requires the user to have a model of how these conflicts get resolved. 2. We're not OK with taking properties that have related flags. --conf specifies property that gets passed to the SparkContext and has no effect on anything that happens before it's created. To save users from themselves, if someone passes spark.master or spark.app.name through --conf, we ignore it or throw an error. I'm a little more partial to approach 2 because I think the mental model is a little simpler. Either way, we should probably enforce the same behavior when a config comes from the defaults file. Lastly, how do we allow setting a default for one of these special flags? E.g. make it so that all jobs run on YARN or Mesos by default. With approach 1, this is relatively straightforward - we use the same logic we'd use on a property that comes in through --conf for making defaults take effect. We might need to add spark properties for flags that don't have them already like --executor-cores. With approach 2, we'd need to add support in the defaults file or somewhere else for specifying flag defaults. > Deal with `--conf` options in spark-submit that relate to flags >
[jira] [Created] (SPARK-2665) Add EqualNS support for HiveQL
Cheng Hao created SPARK-2665: Summary: Add EqualNS support for HiveQL Key: SPARK-2665 URL: https://issues.apache.org/jira/browse/SPARK-2665 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao Hive Supports the operator "<=>", which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2665) Add EqualNS support for HiveQL
[ https://issues.apache.org/jira/browse/SPARK-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072977#comment-14072977 ] Apache Spark commented on SPARK-2665: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/1570 > Add EqualNS support for HiveQL > -- > > Key: SPARK-2665 > URL: https://issues.apache.org/jira/browse/SPARK-2665 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Cheng Hao > > Hive Supports the operator "<=>", which returns same result with EQUAL(=) > operator for non-null operands, but returns TRUE if both are NULL, FALSE if > one of the them is NULL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2414) Remove jquery
[ https://issues.apache.org/jira/browse/SPARK-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2414: --- Assignee: (was: Reynold Xin) > Remove jquery > - > > Key: SPARK-2414 > URL: https://issues.apache.org/jira/browse/SPARK-2414 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Reynold Xin >Priority: Minor > Labels: starter > > SPARK-2384 introduces jquery for tooltip display. We can probably just create > a very simple javascript for tooltip instead of pulling in jquery. > https://github.com/apache/spark/pull/1314 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2414) Remove jquery
[ https://issues.apache.org/jira/browse/SPARK-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2414: --- Labels: starter (was: ) > Remove jquery > - > > Key: SPARK-2414 > URL: https://issues.apache.org/jira/browse/SPARK-2414 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Reynold Xin >Priority: Minor > Labels: starter > > SPARK-2384 introduces jquery for tooltip display. We can probably just create > a very simple javascript for tooltip instead of pulling in jquery. > https://github.com/apache/spark/pull/1314 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073013#comment-14073013 ] lukovnikov commented on SPARK-1405: --- @Isaac, I think it's at https://github.com/yinxusen/spark/blob/lda/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: features > Fix For: 0.9.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073013#comment-14073013 ] lukovnikov edited comment on SPARK-1405 at 7/24/14 9:10 AM: @Isaac, I think it's at https://github.com/yinxusen/spark/blob/lda/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala and here (https://github.com/apache/spark/pull/476/files) for the other changed files as well was (Author: lukovnikov): @Isaac, I think it's at https://github.com/yinxusen/spark/blob/lda/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: features > Fix For: 0.9.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073020#comment-14073020 ] lukovnikov commented on SPARK-1405: --- btw, could this please be merged with the main? there are some conflicts > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: features > Fix For: 0.9.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement
[ https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073085#comment-14073085 ] Apache Spark commented on SPARK-2604: - User 'twinkle-sachdeva' has created a pull request for this issue: https://github.com/apache/spark/pull/1571 > Spark Application hangs on yarn in edge case scenario of executor memory > requirement > > > Key: SPARK-2604 > URL: https://issues.apache.org/jira/browse/SPARK-2604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Twinkle Sachdeva > > In yarn environment, let's say : > MaxAM = Maximum allocatable memory > ExecMem - Executor's memory > if (MaxAM > ExecMem && ( MaxAM - ExecMem) > 384m )) > then Maximum resource validation fails w.r.t executor memory , and > application master gets launched, but when resource is allocated and again > validated, they are returned and application appears to be hanged. > Typical use case is to ask for executor memory = maximum allowed memory as > per yarn config -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement
[ https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073086#comment-14073086 ] Twinkle Sachdeva commented on SPARK-2604: - Please review the pull request : https://github.com/apache/spark/pull/1571 > Spark Application hangs on yarn in edge case scenario of executor memory > requirement > > > Key: SPARK-2604 > URL: https://issues.apache.org/jira/browse/SPARK-2604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Twinkle Sachdeva > > In yarn environment, let's say : > MaxAM = Maximum allocatable memory > ExecMem - Executor's memory > if (MaxAM > ExecMem && ( MaxAM - ExecMem) > 384m )) > then Maximum resource validation fails w.r.t executor memory , and > application master gets launched, but when resource is allocated and again > validated, they are returned and application appears to be hanged. > Typical use case is to ask for executor memory = maximum allowed memory as > per yarn config -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed
[ https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073111#comment-14073111 ] navanee commented on SPARK-2575: spark SVM supports multinomial or binomial classification? > SVMWithSGD throwing Input Validation failed > > > Key: SPARK-2575 > URL: https://issues.apache.org/jira/browse/SPARK-2575 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.1 >Reporter: navanee > > SVMWithSGD throwing Input Validation failed while using Sparse Array as > Input. Though SVMWihtSGD accepts LibSVM format. > Exception trace : > org.apache.spark.SparkException: Input validation failed. > at > org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145) > at > org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124) > at > org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154) > at > org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188) > at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala) > at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143) > at > com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172) > at > com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) > at > org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) > at org.junit.runners.ParentRunner.run(ParentRunner.java:236) > at > org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage
Lianhui Wang created SPARK-2666: --- Summary: when task is FetchFailed cancel running tasks of failedStage Key: SPARK-2666 URL: https://issues.apache.org/jira/browse/SPARK-2666 Project: Spark Issue Type: Bug Reporter: Lianhui Wang in DAGScheduler's handleTaskCompletion,when reason of failed task is FetchFailed, cancel running tasks of failedStage before add failedStage to failedStages queue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073114#comment-14073114 ] Prashant Sharma commented on SPARK-2576: Looking at it. > slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark > QL query on HDFS CSV file > -- > > Key: SPARK-2576 > URL: https://issues.apache.org/jira/browse/SPARK-2576 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.0.1 > Environment: One Mesos 0.19 master without zookeeper and 4 mesos > slaves. > JDK 1.7.51 and Scala 2.10.4 on all nodes. > HDFS from CDH5.0.3 > Spark version: I tried both with the pre-built CDH5 spark package available > from http://spark.apache.org/downloads.html and by packaging spark with sbt > 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here > http://mesosphere.io/learn/run-spark-on-mesos/ > All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have >Reporter: Svend Vanderveken >Assignee: Yin Huai >Priority: Blocker > > Execution of SQL query against HDFS systematically throws a class not found > exception on slave nodes when executing . > (this was originally reported on the user list: > http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) > Sample code (ran from spark-shell): > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Car(timestamp: Long, objectid: String, isGreen: Boolean) > // I get the same error when pointing to the folder > "hdfs://vm28:8020/test/cardata" > val data = sc.textFile("hdfs://vm28:8020/test/cardata/part-0") > val cars = data.map(_.split(",")).map ( ar => Car(ar(0).toLong, ar(1), > ar(2).toBoolean)) > cars.registerAsTable("mcars") > val allgreens = sqlContext.sql("SELECT objectid from mcars where isGreen = > true") > allgreens.collect.take(10).foreach(println) > {code} > Stack trace on the slave nodes: > {code} > I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 > I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave > 20140714-142853-485682442-5050-25487-2 > 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as > executor ID 20140714-142853-485682442-5050-25487-2 > 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: > mesos,mnubohadoop > 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mesos, > mnubohadoop) > 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started > 14/07/16 13:01:17 INFO Remoting: Starting remoting > 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@vm23:38230] > 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@vm23:38230] > 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: > akka.tcp://spark@vm28:41632/user/MapOutputTracker > 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: > akka.tcp://spark@vm28:41632/user/BlockManagerMaster > 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140716130117-8ea0 > 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 > MB. > 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id > = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) > 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager > 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager > 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 > 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server > 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973 > 14/07/16 13:01:18 INFO Executor: Running task ID 2 > 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 > 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with > curMem=0, maxMem=309225062 > 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 122.6 KB, free 294.8 MB) > 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took > 0.294602722 s > 14/07/16 13:01:19 INFO HadoopRDD: Input split: > hdfs://vm28:8020/test/cardata/part-0:23960450+23960451 > I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown > 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2 > java.lang.NoClassDefFoundError: $line11/$read$ > at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) > at
[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073116#comment-14073116 ] Apache Spark commented on SPARK-2666: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/1572 > when task is FetchFailed cancel running tasks of failedStage > > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug >Reporter: Lianhui Wang > > in DAGScheduler's handleTaskCompletion,when reason of failed task is > FetchFailed, cancel running tasks of failedStage before add failedStage to > failedStages queue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2456) Scheduler refactoring
[ https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073122#comment-14073122 ] Nan Zhu commented on SPARK-2456: maybe it's also related: https://github.com/apache/spark/pull/637 > Scheduler refactoring > - > > Key: SPARK-2456 > URL: https://issues.apache.org/jira/browse/SPARK-2456 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > > This is an umbrella ticket to track scheduler refactoring. We want to clearly > define semantics and responsibilities of each component, and define explicit > public interfaces for them so it is easier to understand and to contribute > (also less buggy). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2667) getCallSiteInfo doesn't take into account that graphx is part of spark.
[ https://issues.apache.org/jira/browse/SPARK-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Budau updated SPARK-2667: Description: getCallSiteInfo from org.apache.spark.util.Utils uses a regex pattern to match when a function is part of spark or not. At the moment this does not include GraphX > getCallSiteInfo doesn't take into account that graphx is part of spark. > --- > > Key: SPARK-2667 > URL: https://issues.apache.org/jira/browse/SPARK-2667 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 > Environment: Mac Os X, although its on all versions >Reporter: Adrian Budau >Priority: Trivial > > getCallSiteInfo from org.apache.spark.util.Utils uses a regex pattern to > match when a function is part of spark or not. At the moment this does not > include GraphX -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2667) getCallSiteInfo doesn't take into account that graphx is part of spark.
Adrian Budau created SPARK-2667: --- Summary: getCallSiteInfo doesn't take into account that graphx is part of spark. Key: SPARK-2667 URL: https://issues.apache.org/jira/browse/SPARK-2667 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Environment: Mac Os X, although its on all versions Reporter: Adrian Budau Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2668) Support log4j log to yarn container log directory
Peng Zhang created SPARK-2668: - Summary: Support log4j log to yarn container log directory Key: SPARK-2668 URL: https://issues.apache.org/jira/browse/SPARK-2668 Project: Spark Issue Type: Improvement Components: YARN Reporter: Peng Zhang Fix For: 1.0.0 Assign value of yarn container log directory to java opts "spark.yarn.log.dir", So user defined log4j.properties can reference this value and write log to YARN container directory. Otherwise, user defined file append will log to CWD, and files will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: {code} log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073199#comment-14073199 ] Apache Spark commented on SPARK-2668: - User 'renozhang' has created a pull request for this issue: https://github.com/apache/spark/pull/1573 > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Peng Zhang > Fix For: 1.0.0 > > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2150) Provide direct link to finished application UI in yarn resource manager UI
[ https://issues.apache.org/jira/browse/SPARK-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-2150: - Assignee: Rahul Singhal > Provide direct link to finished application UI in yarn resource manager UI > -- > > Key: SPARK-2150 > URL: https://issues.apache.org/jira/browse/SPARK-2150 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Rahul Singhal >Assignee: Rahul Singhal >Priority: Minor > Fix For: 1.1.0 > > > Currently the link that is provide as the tracking URL for a finished > application in yarn resource manager UI is of the Spark history server home > page. We should provide a direct link to the application UI so that the user > does not have to figure out the correspondence between yarn application ID > and the link on the Spark history server home page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2150) Provide direct link to finished application UI in yarn resource manager UI
[ https://issues.apache.org/jira/browse/SPARK-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-2150. -- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 > Provide direct link to finished application UI in yarn resource manager UI > -- > > Key: SPARK-2150 > URL: https://issues.apache.org/jira/browse/SPARK-2150 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Rahul Singhal >Priority: Minor > Fix For: 1.1.0 > > > Currently the link that is provide as the tracking URL for a finished > application in yarn resource manager UI is of the Spark history server home > page. We should provide a direct link to the application UI so that the user > does not have to figure out the correspondence between yarn application ID > and the link on the Spark history server home page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1112) When spark.akka.frameSize > 10, task results bigger than 10MiB block execution
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073244#comment-14073244 ] DjvuLee commented on SPARK-1112: Does anyone test in version0.9.2,I found it also failed , while v1.0.1 & v1.1.0 is ok. > When spark.akka.frameSize > 10, task results bigger than 10MiB block execution > -- > > Key: SPARK-1112 > URL: https://issues.apache.org/jira/browse/SPARK-1112 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0, 1.0.0 >Reporter: Guillaume Pitel >Assignee: Xiangrui Meng >Priority: Blocker > Fix For: 0.9.2 > > > When I set the spark.akka.frameSize to something over 10, the messages sent > from the executors to the driver completely block the execution if the > message is bigger than 10MiB and smaller than the frameSize (if it's above > the frameSize, it's ok) > Workaround is to set the spark.akka.frameSize to 10. In this case, since > 0.8.1, the blockManager deal with the data to be sent. It seems slower than > akka direct message though. > The configuration seems to be correctly read (see actorSystemConfig.txt), so > I don't see where the 10MiB could come from -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073246#comment-14073246 ] Thomas Graves commented on SPARK-2668: -- Sorry I don't follow what you are saying here. spark on yarn uses the yarn approved logging directories and aggregation works fine. Perhaps YARN is misconfigured? > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Peng Zhang > Fix For: 1.0.0 > > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073278#comment-14073278 ] Peng Zhang commented on SPARK-2668: --- [~tgraves] Original log works fine, and log will be written to yarn container log directory and named as "stderr". But when I want to define my own log4j configuration, for example using RollingAppender to avoid log file too big, especially for spark Streaming(7 x 24 hours), I should can't specify the base directory for log. So adding "spark.yarn.log.dir" will help for reference in log4j.properties, like the example in description. Otherwise, log files will be located in container's working directory. > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Peng Zhang > Fix For: 1.0.0 > > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073278#comment-14073278 ] Peng Zhang edited comment on SPARK-2668 at 7/24/14 3:12 PM: [~tgraves] Original log works fine, and log will be written to yarn container log directory and named as "stderr". But when I want to define my own log4j configuration, for example using RollingAppender to avoid log file too big, especially for spark Streaming(7 x 24 hours), I should specify the base directory for log. So adding "spark.yarn.log.dir" will help for reference in log4j.properties, like the example in description. Otherwise, log files will be located in container's working directory. was (Author: peng.zhang): [~tgraves] Original log works fine, and log will be written to yarn container log directory and named as "stderr". But when I want to define my own log4j configuration, for example using RollingAppender to avoid log file too big, especially for spark Streaming(7 x 24 hours), I should can't specify the base directory for log. So adding "spark.yarn.log.dir" will help for reference in log4j.properties, like the example in description. Otherwise, log files will be located in container's working directory. > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Peng Zhang > Fix For: 1.0.0 > > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073310#comment-14073310 ] Thomas Graves commented on SPARK-2668: -- Oh, I see you just want a variable to reference from the log4j config. I understand the use case and really YARN should solve this for you. There are jira out there to support long running tasks on yarn, the one for logs is: https://issues.apache.org/jira/browse/YARN-1104 This might be ok for short term workaround for that since its just reading and not allowing user to set it. I need to look at it a bit closer. > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Peng Zhang > Fix For: 1.0.0 > > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed
[ https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073329#comment-14073329 ] Xiangrui Meng commented on SPARK-2575: -- [~dbtsai] sent a PR for multinomial logistic regression: https://github.com/apache/spark/pull/1379 Btw, is your problem solved? > SVMWithSGD throwing Input Validation failed > > > Key: SPARK-2575 > URL: https://issues.apache.org/jira/browse/SPARK-2575 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.1 >Reporter: navanee > > SVMWithSGD throwing Input Validation failed while using Sparse Array as > Input. Though SVMWihtSGD accepts LibSVM format. > Exception trace : > org.apache.spark.SparkException: Input validation failed. > at > org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145) > at > org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124) > at > org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154) > at > org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188) > at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala) > at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143) > at > com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172) > at > com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) > at > org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) > at org.junit.runners.ParentRunner.run(ParentRunner.java:236) > at > org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode
Maxim Ivanov created SPARK-2669: --- Summary: Hadoop configuration is not localised when submitting job in yarn-cluster mode Key: SPARK-2669 URL: https://issues.apache.org/jira/browse/SPARK-2669 Project: Spark Issue Type: Bug Reporter: Maxim Ivanov I'd like to propose a fix for a problem when Hadoop configuration is not localized when job is submitted in yarn-cluster mode. Here is a description from github pull request https://github.com/apache/spark/pull/1574 This patch fixes a problem when Spark driver is run in the container managed by YARN ResourceManager it inherits configuration from a NodeManager process, which can be different from the Hadoop configuration present on the client (submitting machine). Problem is most vivid when fs.defaultFS property differs between these two. Hadoop MR solves it by serializing client's Hadoop configuration into job.xml in application staging directory and then making Application Master to use it. That guarantees that regardless of execution nodes configurations all application containers use same config identical to one on the client side. This patch uses similar approach. YARN ClientBase serializes configuration and adds it to ClientDistributedCacheManager under "job.xml" link name. ClientDistributedCacheManager is then utilizes Hadoop localizer to deliver it to whatever container is started by this application, including the one running Spark driver. YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM container request which is then used by SparkHadoopUtil.newConfiguration to trigger new behavior when machine-wide hadoop configuration is merged with application specific job.xml (exactly how it is done in Hadoop MR). SparkContext is then follows same approach, adding SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use client-side Hadopo configuration. Also all the references to "new Configuration()" which might be executed on YARN cluster side are changed to use SparkHadoopUtil.get.conf Please note that it fixes only core Spark, the part which I am comfortable to test and verify the result. I didn't descend into steaming/shark directories, so things might need to be changed there too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073338#comment-14073338 ] Apache Spark commented on SPARK-2669: - User 'redbaron' has created a pull request for this issue: https://github.com/apache/spark/pull/1574 > Hadoop configuration is not localised when submitting job in yarn-cluster mode > -- > > Key: SPARK-2669 > URL: https://issues.apache.org/jira/browse/SPARK-2669 > Project: Spark > Issue Type: Bug >Reporter: Maxim Ivanov > > I'd like to propose a fix for a problem when Hadoop configuration is not > localized when job is submitted in yarn-cluster mode. Here is a description > from github pull request https://github.com/apache/spark/pull/1574 > This patch fixes a problem when Spark driver is run in the container > managed by YARN ResourceManager it inherits configuration from a > NodeManager process, which can be different from the Hadoop > configuration present on the client (submitting machine). Problem is > most vivid when fs.defaultFS property differs between these two. > Hadoop MR solves it by serializing client's Hadoop configuration into > job.xml in application staging directory and then making Application > Master to use it. That guarantees that regardless of execution nodes > configurations all application containers use same config identical to > one on the client side. > This patch uses similar approach. YARN ClientBase serializes > configuration and adds it to ClientDistributedCacheManager under > "job.xml" link name. ClientDistributedCacheManager is then utilizes > Hadoop localizer to deliver it to whatever container is started by this > application, including the one running Spark driver. > YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM > container request which is then used by SparkHadoopUtil.newConfiguration > to trigger new behavior when machine-wide hadoop configuration is merged > with application specific job.xml (exactly how it is done in Hadoop MR). > SparkContext is then follows same approach, adding > SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use > client-side Hadopo configuration. > Also all the references to "new Configuration()" which might be executed > on YARN cluster side are changed to use SparkHadoopUtil.get.conf > Please note that it fixes only core Spark, the part which I am > comfortable to test and verify the result. I didn't descend into > steaming/shark directories, so things might need to be changed there too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073369#comment-14073369 ] Peng Zhang commented on SPARK-2668: --- Yes, this is a common issue for long running tasks on yarn. Our solution for Spark Streaming is using RollingAppender to keep only the latest 10 x 100M log files on disk. This will help log view by yarn UI(single file is not too big), and also avoid disk overflow. Besides file appender, we also put all log messages to scribe service which writes messages to HDFS (using log4j appender for scribe) This will help analyse all logs generated during running. > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Peng Zhang > Fix For: 1.0.0 > > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1264) Documentation for setting heap sizes across all configurations
[ https://issues.apache.org/jira/browse/SPARK-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-1264: -- Assignee: (was: Aaron Davidson) > Documentation for setting heap sizes across all configurations > -- > > Key: SPARK-1264 > URL: https://issues.apache.org/jira/browse/SPARK-1264 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > As a user, there are lots of places to configure heap sizes, and it takes a > bit of trial and error to figure out how to configure what you want. > We need some more clear documentation on how set these for the cross product > of Spark components (master, worker, executor, driver, shell) and deployment > modes (Standalone, YARN, Mesos, EC2?). > I'm happy to do the authoring if someone can help pull together the relevant > details. > Here's the best I've got so far: > {noformat} > # Standalone cluster > Master - SPARK_DAEMON_MEMORY - default: 512mb > Worker - SPARK_DAEMON_MEMORY vs SPARK_WORKER_MEMORY? - default: ? See > WorkerArguments.inferDefaultMemory() > Executor - spark.executor.memory > Driver - SPARK_DRIVER_MEMORY - default: 512mb > Shell - A pre-built driver so SPARK_DRIVER_MEMORY - default: 512mb > # EC2 cluster > Master - ? > Worker - ? > Executor - ? > Driver - ? > Shell - ? > # Mesos cluster > Master - SPARK_DAEMON_MEMORY > Worker - SPARK_DAEMON_MEMORY > Executor - SPARK_EXECUTOR_MEMORY > Driver - SPARK_DRIVER_MEMORY > Shell - A pre-built driver so SPARK_DRIVER_MEMORY > # YARN cluster > Master - SPARK_MASTER_MEMORY ? > Worker - SPARK_WORKER_MEMORY ? > Executor - SPARK_EXECUTOR_MEMORY > Driver - SPARK_DRIVER_MEMORY > Shell - A pre-built driver so SPARK_DRIVER_MEMORY > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2583) ConnectionManager cannot distinguish whether error occurred or not
[ https://issues.apache.org/jira/browse/SPARK-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073419#comment-14073419 ] Kousuke Saruta commented on SPARK-2583: --- I have added some test cases to my PR for this issue. > ConnectionManager cannot distinguish whether error occurred or not > -- > > Key: SPARK-2583 > URL: https://issues.apache.org/jira/browse/SPARK-2583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > ConnectionManager#handleMessage sent empty messages to another peer if some > error occurred or not in onReceiveCalback. > {code} > val ackMessage = if (onReceiveCallback != null) { > logDebug("Calling back") > onReceiveCallback(bufferMessage, connectionManagerId) > } else { > logDebug("Not calling back as callback is null") > None > } > if (ackMessage.isDefined) { > if (!ackMessage.get.isInstanceOf[BufferMessage]) { > logDebug("Response to " + bufferMessage + " is not a buffer > message, it is of type " > + ackMessage.get.getClass) > } else if (!ackMessage.get.asInstanceOf[BufferMessage].hasAckId) { > logDebug("Response to " + bufferMessage + " does not have ack > id set") > ackMessage.get.asInstanceOf[BufferMessage].ackId = > bufferMessage.id > } > } > // We have no way to tell peer whether error occurred or not > sendMessage(connectionManagerId, ackMessage.getOrElse { > Message.createBufferMessage(bufferMessage.id) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2479) Comparing floating-point numbers using relative error in UnitTests
[ https://issues.apache.org/jira/browse/SPARK-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073430#comment-14073430 ] Apache Spark commented on SPARK-2479: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/1576 > Comparing floating-point numbers using relative error in UnitTests > -- > > Key: SPARK-2479 > URL: https://issues.apache.org/jira/browse/SPARK-2479 > Project: Spark > Issue Type: Improvement >Reporter: DB Tsai >Assignee: DB Tsai > > Floating point math is not exact, and most floating-point numbers end up > being slightly imprecise due to rounding errors. Simple values like 0.1 > cannot be precisely represented using binary floating point numbers, and the > limited precision of floating point numbers means that slight changes in the > order of operations or the precision of intermediates can change the result. > That means that comparing two floats to see if they are equal is usually not > what we want. As long as this imprecision stays small, it can usually be > ignored. > See the following famous article for detail. > http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ > For example: > float a = 0.15 + 0.15 > float b = 0.1 + 0.2 > if(a == b) // can be false! > if(a >= b) // can also be false! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2538) External aggregation in Python
[ https://issues.apache.org/jira/browse/SPARK-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2538: - Priority: Critical (was: Major) > External aggregation in Python > -- > > Key: SPARK-2538 > URL: https://issues.apache.org/jira/browse/SPARK-2538 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.0, 1.0.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Labels: pyspark > Fix For: 1.0.0, 1.0.1 > > Original Estimate: 72h > Remaining Estimate: 72h > > For huge reduce tasks, user will got out of memory exception when all the > data can not fit in memory. > It should put some of the data into disks and then merge them together, just > like what we do in Scala. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2670) FetchFailedException should be thrown when local fetch has failed
Kousuke Saruta created SPARK-2670: - Summary: FetchFailedException should be thrown when local fetch has failed Key: SPARK-2670 URL: https://issues.apache.org/jira/browse/SPARK-2670 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Kousuke Saruta In BasicBlockFetchIterator, when remote fetch has failed, then FetchResult which size is -1 is set to results. {code} case None => { logError("Could not get block(s) from " + cmId) for ((blockId, size) <- req.blocks) { results.put(new FetchResult(blockId, -1, null)) } {code} The size -1 means fetch fail and BlockStoreShuffleFetcher#unpackBlock throws FetchFailedException so that we can retry. But, when local fetch has failed, the failed FetchResult is not set. So, we cannot retry for the FetchResult. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2619) Configurable file-mode for spark/bin folder in the .deb package.
[ https://issues.apache.org/jira/browse/SPARK-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2619: --- Assignee: Christian Tzolov > Configurable file-mode for spark/bin folder in the .deb package. > - > > Key: SPARK-2619 > URL: https://issues.apache.org/jira/browse/SPARK-2619 > Project: Spark > Issue Type: Improvement > Components: Build, Deploy >Reporter: Christian Tzolov >Assignee: Christian Tzolov > > Currently the /bin folder in the .dep package is hardcoded to 744. So only > the Root user (deb.user defaults to root) can run Spark jobs. > If we make /bin filemode a configural maven property then we easily generate > a package with less restrictive execution rights. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2619) Configurable file-mode for spark/bin folder in the .deb package.
[ https://issues.apache.org/jira/browse/SPARK-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2619. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1531 [https://github.com/apache/spark/pull/1531] > Configurable file-mode for spark/bin folder in the .deb package. > - > > Key: SPARK-2619 > URL: https://issues.apache.org/jira/browse/SPARK-2619 > Project: Spark > Issue Type: Improvement > Components: Build, Deploy >Reporter: Christian Tzolov >Assignee: Christian Tzolov > Fix For: 1.1.0 > > > Currently the /bin folder in the .dep package is hardcoded to 744. So only > the Root user (deb.user defaults to root) can run Spark jobs. > If we make /bin filemode a configural maven property then we easily generate > a package with less restrictive execution rights. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2671) BlockObjectWriter should create parent directory when the directory doesn't exist
Kousuke Saruta created SPARK-2671: - Summary: BlockObjectWriter should create parent directory when the directory doesn't exist Key: SPARK-2671 URL: https://issues.apache.org/jira/browse/SPARK-2671 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Kousuke Saruta Priority: Minor BlockObjectWriter#open expects parent directory is present. {code} override def open(): BlockObjectWriter = { fos = new FileOutputStream(file, true) ts = new TimeTrackingOutputStream(fos) channel = fos.getChannel() lastValidPosition = initialPosition bs = compressStream(new BufferedOutputStream(ts, bufferSize)) objOut = serializer.newInstance().serializeStream(bs) initialized = true this } {code} Normally, the parent directory is created by DiskBlockManager#createLocalDirs but, just in case, BlockObjectWriter#open should check the existence of the directory and create the directory if the directory does not exist. I think, recoverable error should be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2603) Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala
[ https://issues.apache.org/jira/browse/SPARK-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2603: Fix Version/s: 1.0.2 1.1.0 > Remove unnecessary toMap and toList in converting Java collections to Scala > collections JsonRDD.scala > - > > Key: SPARK-2603 > URL: https://issues.apache.org/jira/browse/SPARK-2603 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > Fix For: 1.1.0, 1.0.2 > > > In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a > Scala one. These two operations are pretty expensive because they read > elements from a Java Map/List and then load to a Scala Map/List. We can use > Scala wrappers to wrap those Java collections instead of using toMap/toList. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2603) Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala
[ https://issues.apache.org/jira/browse/SPARK-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2603. - Resolution: Fixed > Remove unnecessary toMap and toList in converting Java collections to Scala > collections JsonRDD.scala > - > > Key: SPARK-2603 > URL: https://issues.apache.org/jira/browse/SPARK-2603 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a > Scala one. These two operations are pretty expensive because they read > elements from a Java Map/List and then load to a Scala Map/List. We can use > Scala wrappers to wrap those Java collections instead of using toMap/toList. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2672) support compressed file in wholeFile()
Davies Liu created SPARK-2672: - Summary: support compressed file in wholeFile() Key: SPARK-2672 URL: https://issues.apache.org/jira/browse/SPARK-2672 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Fix For: 1.1.0 The wholeFile() can not read compressed files, it should be, just like textFile(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2673) Improve Spark so that we can attach Debugger to Executors easily
Kousuke Saruta created SPARK-2673: - Summary: Improve Spark so that we can attach Debugger to Executors easily Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Reporter: Kousuke Saruta In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) It's difficult for Executors running on the same machine to open debug port because we can only pass same JVM options to all executors. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called
[ https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073518#comment-14073518 ] Apache Spark commented on SPARK-2464: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/1577 > Twitter Receiver does not stop correctly when streamingContext.stop is called > - > > Key: SPARK-2464 > URL: https://issues.apache.org/jira/browse/SPARK-2464 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.0, 1.0.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1154) Spark fills up disk with app-* folders
[ https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073521#comment-14073521 ] Andrew Ash commented on SPARK-1154: --- For the record, this is Evan's PR that closed this ticket: https://github.com/apache/spark/pull/288 > Spark fills up disk with app-* folders > -- > > Key: SPARK-1154 > URL: https://issues.apache.org/jira/browse/SPARK-1154 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Evan Chan >Assignee: Mingyu Kim >Priority: Critical > Labels: starter > Fix For: 1.0.0 > > > Current version of Spark fills up the disk with many app-* folders: > $ ls /var/lib/spark > app-20140210022347-0597 app-20140212173327-0627 app-20140218154110-0657 > app-20140225232537-0017 app-20140225233548-0047 > app-20140210022407-0598 app-20140212173347-0628 app-20140218154130-0658 > app-20140225232551-0018 app-20140225233556-0048 > app-20140210022427-0599 app-20140212173754-0629 app-20140218164232-0659 > app-20140225232611-0019 app-20140225233603-0049 > app-20140210022447-0600 app-20140212182235-0630 app-20140218165133-0660 > app-20140225232802-0020 app-20140225233610-0050 > app-20140210022508-0601 app-20140212182256-0631 app-20140218165148-0661 > app-20140225232822-0021 app-20140225233617-0051 > app-20140210022528-0602 app-2014021314-0632 app-20140218165225-0662 > app-20140225232940-0022 app-20140225233624-0052 > app-20140211024356-0603 app-20140213002026-0633 app-20140218165249-0663 > app-20140225233002-0023 app-20140225233631-0053 > app-20140211024417-0604 app-20140213154948-0634 app-20140218172030-0664 > app-20140225233056-0024 app-20140225233725-0054 > app-20140211024437-0605 app-20140213171810-0635 app-20140218193853-0665 > app-20140225233108-0025 app-20140225233731-0055 > app-20140211024457-0606 app-20140213193637-0636 app-20140218194442-0666 > app-20140225233124-0026 app-20140225233733-0056 > app-20140211024517-0607 app-20140214011513-0637 app-20140218194746-0667 > app-20140225233133-0027 app-20140225233734-0057 > app-20140211024538-0608 app-20140214012151-0638 app-20140218194822-0668 > app-20140225233147-0028 app-20140225233749-0058 > app-20140211193443-0609 app-20140214013134-0639 app-20140218212317-0669 > app-20140225233208-0029 app-20140225233759-0059 > app-20140211195210-0610 app-20140214013332-0640 app-20140225180142- > app-20140225233215-0030 app-20140225233809-0060 > app-20140211213935-0611 app-20140214013642-0641 app-20140225180411-0001 > app-20140225233224-0031 app-20140225233828-0061 > app-20140211214227-0612 app-20140214014246-0642 app-20140225180431-0002 > app-20140225233232-0032 app-20140225234719-0062 > app-20140211215317-0613 app-20140214014607-0643 app-20140225180452-0003 > app-20140225233239-0033 app-20140226032845-0063 > app-20140211224601-0614 app-20140214184943-0644 app-20140225180512-0004 > app-20140225233320-0034 app-20140226033004-0064 > app-20140212022206-0615 app-20140214185118-0645 app-20140225180533-0005 > app-20140225233328-0035 app-20140226033119-0065 > app-2014021206-0616 app-20140214185851-0646 app-20140225180553-0006 > app-20140225233354-0036 app-2014022604-0066 > app-20140212022246-0617 app-20140214222856-0647 app-20140225181115-0007 > app-20140225233402-0037 app-20140226033354-0067 > app-20140212043704-0618 app-20140214231312-0648 app-20140225181244-0008 > app-20140225233409-0038 app-20140226033538-0068 > app-20140212043724-0619 app-20140214231434-0649 app-20140225182051-0009 > app-20140225233416-0039 app-20140226033826-0069 > app-20140212043745-0620 app-20140214231542-0650 app-20140225183009-0010 > app-20140225233426-0040 app-20140226034002-0070 > app-20140212044016-0621 app-20140214231616-0651 app-20140225184133-0011 > app-20140225233432-0041 app-20140226034053-0071 > app-20140212044203-0622 app-20140214233016-0652 app-20140225184318-0012 > app-20140225233439-0042 app-20140226034234-0072 > app-20140212044224-0623 app-20140214233037-0653 app-20140225184709-0013 > app-20140225233447-0043 app-20140226034426-0073 > app-20140212045034-0624 app-20140218153242-0654 app-20140225184844-0014 > app-20140225233526-0044 app-20140226034447-0074 > app-20140212045119-0625 app-20140218153341-0655 app-20140225190051-0015 > app-20140225233534-0045 > app-20140212173310-0626 app-20140218153442-0656 app-20140225232516-0016 > app-20140225233540-0046 > This problem is particularly bad if you have a whole bunch of fast jobs. > Also what makes the problem worse is that any jars for jobs is downloaded > into the app-* folder, so that fills up the disk particularly fast. > I would like to propose two things: > 1) Spark should have a cl
[jira] [Commented] (SPARK-2670) FetchFailedException should be thrown when local fetch has failed
[ https://issues.apache.org/jira/browse/SPARK-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073539#comment-14073539 ] Apache Spark commented on SPARK-2670: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/1578 > FetchFailedException should be thrown when local fetch has failed > - > > Key: SPARK-2670 > URL: https://issues.apache.org/jira/browse/SPARK-2670 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Kousuke Saruta > > In BasicBlockFetchIterator, when remote fetch has failed, then FetchResult > which size is -1 is set to results. > {code} >case None => { > logError("Could not get block(s) from " + cmId) > for ((blockId, size) <- req.blocks) { > results.put(new FetchResult(blockId, -1, null)) > } > {code} > The size -1 means fetch fail and BlockStoreShuffleFetcher#unpackBlock throws > FetchFailedException so that we can retry. > But, when local fetch has failed, the failed FetchResult is not set. > So, we cannot retry for the FetchResult. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2674) Add date and time types to inferSchema
Hossein Falaki created SPARK-2674: - Summary: Add date and time types to inferSchema Key: SPARK-2674 URL: https://issues.apache.org/jira/browse/SPARK-2674 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.0.0 Reporter: Hossein Falaki When I try inferSchema in PySpark on an RDD of dictionary that contains a datatime.datetime object, I get the following exception: {code} Object of type java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?] cannot be used {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2675) LiveListenerBus should set higher capacity for its event queue
Zongheng Yang created SPARK-2675: Summary: LiveListenerBus should set higher capacity for its event queue Key: SPARK-2675 URL: https://issues.apache.org/jira/browse/SPARK-2675 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Zongheng Yang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2676) CLONE - LiveListenerBus should set higher capacity for its event queue
Zongheng Yang created SPARK-2676: Summary: CLONE - LiveListenerBus should set higher capacity for its event queue Key: SPARK-2676 URL: https://issues.apache.org/jira/browse/SPARK-2676 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Zongheng Yang Assignee: Zongheng Yang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2676) CLONE - LiveListenerBus should set higher capacity for its event queue
[ https://issues.apache.org/jira/browse/SPARK-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zongheng Yang closed SPARK-2676. Resolution: Duplicate > CLONE - LiveListenerBus should set higher capacity for its event queue > --- > > Key: SPARK-2676 > URL: https://issues.apache.org/jira/browse/SPARK-2676 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Zongheng Yang >Assignee: Zongheng Yang > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2675) LiveListenerBus should set higher capacity for its event queue
[ https://issues.apache.org/jira/browse/SPARK-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073546#comment-14073546 ] Apache Spark commented on SPARK-2675: - User 'concretevitamin' has created a pull request for this issue: https://github.com/apache/spark/pull/1579 > LiveListenerBus should set higher capacity for its event queue > --- > > Key: SPARK-2675 > URL: https://issues.apache.org/jira/browse/SPARK-2675 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Zongheng Yang >Assignee: Zongheng Yang > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2671) BlockObjectWriter should create parent directory when the directory doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073548#comment-14073548 ] Apache Spark commented on SPARK-2671: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/1580 > BlockObjectWriter should create parent directory when the directory doesn't > exist > - > > Key: SPARK-2671 > URL: https://issues.apache.org/jira/browse/SPARK-2671 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Kousuke Saruta >Priority: Minor > > BlockObjectWriter#open expects parent directory is present. > {code} > override def open(): BlockObjectWriter = { > fos = new FileOutputStream(file, true) > ts = new TimeTrackingOutputStream(fos) > channel = fos.getChannel() > lastValidPosition = initialPosition > bs = compressStream(new BufferedOutputStream(ts, bufferSize)) > objOut = serializer.newInstance().serializeStream(bs) > initialized = true > this > } > {code} > Normally, the parent directory is created by DiskBlockManager#createLocalDirs > but, just in case, BlockObjectWriter#open should check the existence of the > directory and create the directory if the directory does not exist. > I think, recoverable error should be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2037) yarn client mode doesn't support spark.yarn.max.executor.failures
[ https://issues.apache.org/jira/browse/SPARK-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-2037. -- Resolution: Fixed Fix Version/s: 1.1.0 > yarn client mode doesn't support spark.yarn.max.executor.failures > - > > Key: SPARK-2037 > URL: https://issues.apache.org/jira/browse/SPARK-2037 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Guoqiang Li > Fix For: 1.1.0 > > > yarn client mode doesn't support the config spark.yarn.max.executor.failures. > We should investigate if we need it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2674) Add date and time types to inferSchema
[ https://issues.apache.org/jira/browse/SPARK-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2674: Assignee: Davies Liu (was: Michael Armbrust) > Add date and time types to inferSchema > -- > > Key: SPARK-2674 > URL: https://issues.apache.org/jira/browse/SPARK-2674 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.0.0 >Reporter: Hossein Falaki >Assignee: Davies Liu > > When I try inferSchema in PySpark on an RDD of dictionary that contains a > datatime.datetime object, I get the following exception: > {code} > Object of type > java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?] > cannot be used > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2674) Add date and time types to inferSchema
[ https://issues.apache.org/jira/browse/SPARK-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-2674: --- Assignee: Michael Armbrust > Add date and time types to inferSchema > -- > > Key: SPARK-2674 > URL: https://issues.apache.org/jira/browse/SPARK-2674 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.0.0 >Reporter: Hossein Falaki >Assignee: Michael Armbrust > > When I try inferSchema in PySpark on an RDD of dictionary that contains a > datatime.datetime object, I get the following exception: > {code} > Object of type > java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?] > cannot be used > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2674) Add date and time types to inferSchema
[ https://issues.apache.org/jira/browse/SPARK-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2674: Target Version/s: 1.1.0 > Add date and time types to inferSchema > -- > > Key: SPARK-2674 > URL: https://issues.apache.org/jira/browse/SPARK-2674 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.0.0 >Reporter: Hossein Falaki >Assignee: Davies Liu > > When I try inferSchema in PySpark on an RDD of dictionary that contains a > datatime.datetime object, I get the following exception: > {code} > Object of type > java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?] > cannot be used > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2387) Remove the stage barrier for better resource utilization
[ https://issues.apache.org/jira/browse/SPARK-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073597#comment-14073597 ] Kay Ousterhout commented on SPARK-2387: --- Have you done experiments to understand how much this improves performance? With Hadoop MapReduce, I've seen this behavior significantly worsen performance for a few reasons. Ultimately, the problem is that the "reduce" stage (the one the depends on the shuffle map stage) can't finish until all of the map tasks finish. So, if there is a long map straggler, the reduce tasks can't finish anyway -- and now many more slots are hogged by the early reducers, preventing other jobs from making progress. Even worse, if reduce tasks are launched before all map tasks have been launched, the early reducers keep map tasks from being launched, but can end up stopped waiting for input from mappers that haven't completed yet. (Although I didn't look closely at PR1328 so I'm not sure if the latter issue was explicitly prevented in your pull request.) As a result of the above issues, I've heard that many places (I think Facebook, for example) disable this behavior in Hadoop. So, we should make sure this will not hurt performance (and will significantly help!) before adding a lot of complexity to Spark in order to implement it. > Remove the stage barrier for better resource utilization > > > Key: SPARK-2387 > URL: https://issues.apache.org/jira/browse/SPARK-2387 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Rui Li > > DAGScheduler divides a Spark job into multiple stages according to RDD > dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a > shuffle map stage on the map side, and another stage depending on that stage. > Currently, the downstream stage cannot start until all its depended stages > have finished. This barrier between stages leads to idle slots when waiting > for the last few upstream tasks to finish and thus wasting cluster resources. > Therefore we propose to remove the barrier and pre-start the reduce stage > once there're free slots. This can achieve better resource utilization and > improve the overall job performance, especially when there're lots of > executors granted to the application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2387) Remove the stage barrier for better resource utilization
[ https://issues.apache.org/jira/browse/SPARK-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073597#comment-14073597 ] Kay Ousterhout edited comment on SPARK-2387 at 7/24/14 8:23 PM: Have you done experiments to understand how much this improves performance? With Hadoop MapReduce, I've seen this behavior significantly worsen performance for a few reasons. Ultimately, the problem is that the "reduce" stage (the one the depends on the shuffle map stage) can't finish until all of the map tasks finish. So, if there is a long map straggler, the reduce tasks can't finish anyway -- and now many more slots are hogged by the early reducers, preventing other jobs from making progress. Even worse, if reduce tasks are launched before all map tasks have been launched, the early reducers keep map tasks from being launched, but can end up stopped waiting for input from mappers that haven't completed yet. (Although it looks like your pull request is done in a way that tries to avoid the latter problem.) As a result of the above issues, I've heard that many places (I think Facebook, for example) disable this behavior in Hadoop. So, we should make sure this will not hurt performance (and will significantly help!) before adding a lot of complexity to Spark in order to implement it. was (Author: kayousterhout): Have you done experiments to understand how much this improves performance? With Hadoop MapReduce, I've seen this behavior significantly worsen performance for a few reasons. Ultimately, the problem is that the "reduce" stage (the one the depends on the shuffle map stage) can't finish until all of the map tasks finish. So, if there is a long map straggler, the reduce tasks can't finish anyway -- and now many more slots are hogged by the early reducers, preventing other jobs from making progress. Even worse, if reduce tasks are launched before all map tasks have been launched, the early reducers keep map tasks from being launched, but can end up stopped waiting for input from mappers that haven't completed yet. (Although I didn't look closely at PR1328 so I'm not sure if the latter issue was explicitly prevented in your pull request.) As a result of the above issues, I've heard that many places (I think Facebook, for example) disable this behavior in Hadoop. So, we should make sure this will not hurt performance (and will significantly help!) before adding a lot of complexity to Spark in order to implement it. > Remove the stage barrier for better resource utilization > > > Key: SPARK-2387 > URL: https://issues.apache.org/jira/browse/SPARK-2387 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Rui Li > > DAGScheduler divides a Spark job into multiple stages according to RDD > dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a > shuffle map stage on the map side, and another stage depending on that stage. > Currently, the downstream stage cannot start until all its depended stages > have finished. This barrier between stages leads to idle slots when waiting > for the last few upstream tasks to finish and thus wasting cluster resources. > Therefore we propose to remove the barrier and pre-start the reduce stage > once there're free slots. This can achieve better resource utilization and > improve the overall job performance, especially when there're lots of > executors granted to the application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2250) show stage RDDs in UI
[ https://issues.apache.org/jira/browse/SPARK-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2250: --- Assignee: Neville Li > show stage RDDs in UI > - > > Key: SPARK-2250 > URL: https://issues.apache.org/jira/browse/SPARK-2250 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Neville Li >Assignee: Neville Li >Priority: Minor > Fix For: 1.1.0 > > > RDDs of each stage can be accessed from StageInfo#rddInfos. It'd be nice to > show them in the UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2250) show stage RDDs in UI
[ https://issues.apache.org/jira/browse/SPARK-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2250. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1188 [https://github.com/apache/spark/pull/1188] > show stage RDDs in UI > - > > Key: SPARK-2250 > URL: https://issues.apache.org/jira/browse/SPARK-2250 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Neville Li >Priority: Minor > Fix For: 1.1.0 > > > RDDs of each stage can be accessed from StageInfo#rddInfos. It'd be nice to > show them in the UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2677) BasicBlockFetchIterator#next can be wait forever
Kousuke Saruta created SPARK-2677: - Summary: BasicBlockFetchIterator#next can be wait forever Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Kousuke Saruta Priority: Critical In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty && (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size <= maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2677: -- Summary: BasicBlockFetchIterator#next can wait forever (was: BasicBlockFetchIterator#next can be wait forever) > BasicBlockFetchIterator#next can wait forever > - > > Key: SPARK-2677 > URL: https://issues.apache.org/jira/browse/SPARK-2677 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Kousuke Saruta >Priority: Critical > > In BasicBlockFetchIterator#next, it waits fetch result on result.take. > {code} > override def next(): (BlockId, Option[Iterator[Any]]) = { > resultsGotten += 1 > val startFetchWait = System.currentTimeMillis() > val result = results.take() > val stopFetchWait = System.currentTimeMillis() > _fetchWaitTime += (stopFetchWait - startFetchWait) > if (! result.failed) bytesInFlight -= result.size > while (!fetchRequests.isEmpty && > (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size <= > maxBytesInFlight)) { > sendRequest(fetchRequests.dequeue()) > } > (result.blockId, if (result.failed) None else > Some(result.deserialize())) > } > {code} > But, results is implemented as LinkedBlockingQueue so if remote executor hang > up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
[ https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073784#comment-14073784 ] koert kuipers commented on SPARK-1855: -- i think this makes sense. we have iterative queries that should be very quick. in case of machine failure i am ok if query fails, we will simply repeat. so i do not care about checkpoint to disk in this situation. but i do care about checkpoint to memory to cut my dependencies, which means they get garbage collected and cached rdds get cleaned up. > Provide memory-and-local-disk RDD checkpointing > --- > > Key: SPARK-1855 > URL: https://issues.apache.org/jira/browse/SPARK-1855 > Project: Spark > Issue Type: New Feature > Components: MLlib, Spark Core >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng > > Checkpointing is used to cut long lineage while maintaining fault tolerance. > The current implementation is HDFS-based. Using the BlockRDD we can create > in-memory-and-local-disk (with replication) checkpoints that are not as > reliable as HDFS-based solution but faster. > It can help applications that require many iterations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2678) `Spark-submit` overrides user application options
Cheng Lian created SPARK-2678: - Summary: `Spark-submit` overrides user application options Key: SPARK-2678 URL: https://issues.apache.org/jira/browse/SPARK-2678 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor Here is an example: {code} ./bin/spark-submit --class Foo some.jar --help {code} SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it should be recognized as a user application option. But it's actually overriden by {{spark-submit}} and will show {{spark-submit}} help message. When directly invoking {{spark-submit}}, the constraints here are: # Options before primary resource should be recognized as {{spark-submit}} options # Options after primary resource should be recognized as user application options The tricky part is how to handle scripts like {{spark-shell}} that delegate {{spark-submit}}. These scripts allow users specify both {{spark-submit}} options like {{--master}} and user defined application options together. For example, say we'd like to write a new script {{start-thriftserver.sh}} to start the Hive Thrift server, basically we may do this: {code} $SPARK_HOME/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@ {code} Then user may call this script like: {code} ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf key=value {code} Notice that all options are captured by {{$@}}. If we put it before {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after {{spark-internal}}, they *should* all be recognized as options of {{HiveThriftServer2}}, but because of this bug, {{--master}} is still recognized as {{spark-submit}} option and leads to the right behavior. Although currently all scripts using {{spark-submit}} work correctly, we still should fix this bug, because it causes option name collision between {{spark-submit}} and user application, and every time we add a new option to {{spark-submit}}, some existing user applications may break. However, solving this bug may cause some incompatible changes. The suggested solution here is using {{--}} as separator of {{spark-submit}} options and user application options. For the Hive Thrift server example above, user should call it in this way: {code} ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf key=value {code} And {{SparkSubmitArguments}} should be responsible for splitting two sets of options and pass them correctly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2678) `Spark-submit` overrides user application options
[ https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-2678: -- Priority: Major (was: Minor) > `Spark-submit` overrides user application options > - > > Key: SPARK-2678 > URL: https://issues.apache.org/jira/browse/SPARK-2678 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.1, 1.0.2 >Reporter: Cheng Lian > > Here is an example: > {code} > ./bin/spark-submit --class Foo some.jar --help > {code} > SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it > should be recognized as a user application option. But it's actually > overriden by {{spark-submit}} and will show {{spark-submit}} help message. > When directly invoking {{spark-submit}}, the constraints here are: > # Options before primary resource should be recognized as {{spark-submit}} > options > # Options after primary resource should be recognized as user application > options > The tricky part is how to handle scripts like {{spark-shell}} that delegate > {{spark-submit}}. These scripts allow users specify both {{spark-submit}} > options like {{--master}} and user defined application options together. For > example, say we'd like to write a new script {{start-thriftserver.sh}} to > start the Hive Thrift server, basically we may do this: > {code} > $SPARK_HOME/bin/spark-submit --class > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@ > {code} > Then user may call this script like: > {code} > ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf > key=value > {code} > Notice that all options are captured by {{$@}}. If we put it before > {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus > {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after > {{spark-internal}}, they *should* all be recognized as options of > {{HiveThriftServer2}}, but because of this bug, {{--master}} is still > recognized as {{spark-submit}} option and leads to the right behavior. > Although currently all scripts using {{spark-submit}} work correctly, we > still should fix this bug, because it causes option name collision between > {{spark-submit}} and user application, and every time we add a new option to > {{spark-submit}}, some existing user applications may break. However, solving > this bug may cause some incompatible changes. > The suggested solution here is using {{--}} as separator of {{spark-submit}} > options and user application options. For the Hive Thrift server example > above, user should call it in this way: > {code} > ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf > key=value > {code} > And {{SparkSubmitArguments}} should be responsible for splitting two sets of > options and pass them correctly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib
Doris Xin created SPARK-2679: Summary: Ser/De for Double to enable calling Java API from python in MLlib Key: SPARK-2679 URL: https://issues.apache.org/jira/browse/SPARK-2679 Project: Spark Issue Type: Sub-task Reporter: Doris Xin In order to enable Java/Scala APIs to be reused in the Python implementation of RandomRDD and Correlations, we need a set of ser/de for the type Double in _common.py and PythonMLLibAPI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073833#comment-14073833 ] Apache Spark commented on SPARK-2679: - User 'dorx' has created a pull request for this issue: https://github.com/apache/spark/pull/1581 > Ser/De for Double to enable calling Java API from python in MLlib > - > > Key: SPARK-2679 > URL: https://issues.apache.org/jira/browse/SPARK-2679 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Doris Xin > > In order to enable Java/Scala APIs to be reused in the Python implementation > of RandomRDD and Correlations, we need a set of ser/de for the type Double in > _common.py and PythonMLLibAPI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2298) Show stage attempt in UI
[ https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2298: --- Priority: Critical (was: Major) > Show stage attempt in UI > > > Key: SPARK-2298 > URL: https://issues.apache.org/jira/browse/SPARK-2298 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Reynold Xin >Assignee: Masayoshi TSUZUKI >Priority: Critical > Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png > > > We should add a column to the web ui to show stage attempt id. Then tasks > should be grouped by (stageId, stageAttempt) tuple. > When a stage is resubmitted (e.g. due to fetch failures), we should get a > different entry in the web ui and tasks for the resubmission go there. > See the attached screenshot for the confusing status quo. We currently show > the same stage entry twice, and then tasks appear in both. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073879#comment-14073879 ] Doris Xin commented on SPARK-2515: -- Here's the proposed API for chi-squared tests (lives in org.apache.spark.mllib.stat.Statistics): {code} def chiSquare(X: RDD[Vector], method: String = “pearson”): ChiSquareTestResult def chiSquare(x: RDD[Double], y: RDD[Double], method: String = “pearson”): ChiSquareTestResult {code} where ChiSquareTestResult <: TestResult looks like: {code} pValue: Double df: Array[Int] //normally a single but need to be more for anova statistic: Double ChiSquareSummary <: Summary {code} So a couple points of discussion: 1. Of the many variants of the chi-squared test, what methods in addition to "pearson" do we want to support (hopefully based on popular demand)? http://en.wikipedia.org/wiki/Chi-squared_test 2. What special fields should ChiSquareSummary have? > Hypothesis testing > -- > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called
[ https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-2464. -- Resolution: Fixed > Twitter Receiver does not stop correctly when streamingContext.stop is called > - > > Key: SPARK-2464 > URL: https://issues.apache.org/jira/browse/SPARK-2464 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.0, 1.0.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2014) Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default
[ https://issues.apache.org/jira/browse/SPARK-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2014. -- Resolution: Fixed Fix Version/s: 1.1.0 > Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default > -- > > Key: SPARK-2014 > URL: https://issues.apache.org/jira/browse/SPARK-2014 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Matei Zaharia >Assignee: Prashant Sharma > Fix For: 1.1.0 > > > Since the data is serialized on the Python side, there's not much point in > keeping it as byte arrays in Java, or even in skipping compression. We should > make cache() in PySpark use MEMORY_ONLY_SER and turn on spark.rdd.compress > for it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1044) Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon
[ https://issues.apache.org/jira/browse/SPARK-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073930#comment-14073930 ] Andrew Ash commented on SPARK-1044: --- Filling up the work dir could be alleviated by fixing https://issues.apache.org/jira/browse/SPARK-1860 so we could enable worker dir cleanup automatically. If we had automatic worker dir cleanup, would you still want to move the work directory to somewhere else? > Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon > - > > Key: SPARK-1044 > URL: https://issues.apache.org/jira/browse/SPARK-1044 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Tathagata Das >Priority: Minor > > The default log location is SPARK_HOME/work/ and this leads to disk space > running out pretty quickly. The spark-ec2 scripts should configure the > cluster to automatically set the logging directory to /mnt/spark-work/ or > something like that on the mounted disks. The SPARK_HOME/work may also be > symlinked to that directory to maintain the existing setup. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-786) Clean up old work directories in standalone worker
[ https://issues.apache.org/jira/browse/SPARK-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073932#comment-14073932 ] Andrew Ash commented on SPARK-786: -- Agreed. With SPARK-1860 we could re-enable that the features from that PR by default and be good here (it was disabled after it had negative effects with long-running transactions). I think this ticket can be closed as a dupe of that one > Clean up old work directories in standalone worker > -- > > Key: SPARK-786 > URL: https://issues.apache.org/jira/browse/SPARK-786 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 0.7.2 >Reporter: Matei Zaharia > > We should add a setting to clean old work directories after X days. > Otherwise, the directory gets filled forever with shuffle files and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1044) Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon
[ https://issues.apache.org/jira/browse/SPARK-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073938#comment-14073938 ] Allan Douglas R. de Oliveira commented on SPARK-1044: - I think it is still a good idea even if the automatic cleanup is implemented. One large job or many small jobs can fill many gigabytes before the cleanup can kick in. > Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon > - > > Key: SPARK-1044 > URL: https://issues.apache.org/jira/browse/SPARK-1044 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Tathagata Das >Priority: Minor > > The default log location is SPARK_HOME/work/ and this leads to disk space > running out pretty quickly. The spark-ec2 scripts should configure the > cluster to automatically set the logging directory to /mnt/spark-work/ or > something like that on the mounted disks. The SPARK_HOME/work may also be > symlinked to that directory to maintain the existing setup. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1030) unneeded file required when running pyspark program using yarn-client
[ https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1030. --- Resolution: Fixed Fix Version/s: 1.0.0 Closing this now, since it was addressed as part of Spark 1.0's PySpark on YARN patches (including SPARK-1004). > unneeded file required when running pyspark program using yarn-client > - > > Key: SPARK-1030 > URL: https://issues.apache.org/jira/browse/SPARK-1030 > Project: Spark > Issue Type: Bug > Components: Deploy, PySpark, YARN >Affects Versions: 0.8.1 >Reporter: Diana Carroll >Assignee: Josh Rosen > Fix For: 1.0.0 > > > I can successfully run a pyspark program using the yarn-client master using > the following command: > {code} > SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar > \ > SPARK_YARN_APP_JAR=~/testdata.txt pyspark \ > test1.py > {code} > However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python > program, and therefore there's no JAR. If I don't set the value, or if I set > the value to a non-existent files, Spark gives me an error message. > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46) > {code} > or > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.io.FileNotFoundException: File file:dummy.txt does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) > {code} > My program is very simple: > {code} > from pyspark import SparkContext > def main(): > sc = SparkContext("yarn-client", "Simple App") > logData = > sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log") > numjpgs = logData.filter(lambda s: '.jpg' in s).count() > print "Number of JPG requests: " + str(numjpgs) > {code} > Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at > all; I can point it at anything, as long as it's a valid, accessible file, > and it works the same. > Although there's an obvious workaround for this bug, it's high priority from > my perspective because I'm working on a course to teach people how to do > this, and it's really hard to explain why this variable is needed! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2680) Lower spark.shuffle.memoryFraction to 0.2 by default
Matei Zaharia created SPARK-2680: Summary: Lower spark.shuffle.memoryFraction to 0.2 by default Key: SPARK-2680 URL: https://issues.apache.org/jira/browse/SPARK-2680 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Fix For: 1.1.0 Seems like it's good to be more conservative on this, in particular to try to fit all the data in the young gen. People can always increase it for performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2529) Clean the closure in foreach and foreachPartition
[ https://issues.apache.org/jira/browse/SPARK-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2529: - Target Version/s: 1.1.0, 1.0.3 (was: 1.1.0, 1.0.2) > Clean the closure in foreach and foreachPartition > - > > Key: SPARK-2529 > URL: https://issues.apache.org/jira/browse/SPARK-2529 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin > > Somehow we didn't clean the closure for foreach and foreachPartition. Should > do that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2531) Make BroadcastNestedLoopJoin take into account a BuildSide
[ https://issues.apache.org/jira/browse/SPARK-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2531: - Target Version/s: 1.1.0, 1.0.3 (was: 1.1.0, 1.0.2) > Make BroadcastNestedLoopJoin take into account a BuildSide > -- > > Key: SPARK-2531 > URL: https://issues.apache.org/jira/browse/SPARK-2531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.1 >Reporter: Zongheng Yang >Assignee: Zongheng Yang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2548) JavaRecoverableWordCount is missing
[ https://issues.apache.org/jira/browse/SPARK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2548: - Target Version/s: 1.1.0, 0.9.3, 1.0.3 (was: 1.1.0, 1.0.2, 0.9.3) > JavaRecoverableWordCount is missing > --- > > Key: SPARK-2548 > URL: https://issues.apache.org/jira/browse/SPARK-2548 > Project: Spark > Issue Type: Bug > Components: Documentation, Streaming >Affects Versions: 0.9.2, 1.0.1 >Reporter: Xiangrui Meng >Priority: Minor > > JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We > need to rewrite the example because the code was lost during the migration > from spark/spark-incubating to apache/spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2506) In yarn-cluster mode, ApplicationMaster does not clean up correctly at the end of the job if users call sc.stop manually
[ https://issues.apache.org/jira/browse/SPARK-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2506: - Target Version/s: 1.0.3 (was: 1.0.2) > In yarn-cluster mode, ApplicationMaster does not clean up correctly at the > end of the job if users call sc.stop manually > > > Key: SPARK-2506 > URL: https://issues.apache.org/jira/browse/SPARK-2506 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core, YARN >Affects Versions: 1.0.1 >Reporter: uncleGen >Priority: Minor > > when i call sc.stop manually, some strange ERRORs will appear: > 1. in driver log: > INFO [Thread-116] YarnAllocationHandler: Completed container > container_1400565786114_79510_01_41 (state: COMPLETE, exit status: 0) > WARN [Thread-4] BlockManagerMaster: Error sending message to > BlockManagerMaster in 3 attempts > akka.pattern.AskTimeoutException: > Recipient[Actor[akka://spark/user/BlockManagerMaster#1994513092]] had already > been terminated. > at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) > at > org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:236) > at > org.apache.spark.storage.BlockManagerMaster.tell(BlockManagerMaster.scala:216) > at > org.apache.spark.storage.BlockManagerMaster.stop(BlockManagerMaster.scala:208) > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:86) > at org.apache.spark.SparkContext.stop(SparkContext.scala:993) > at TestWeibo$.main(TestWeibo.scala:46) > at TestWeibo.main(TestWeibo.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:192) > INFO [Thread-116] ApplicationMaster: Allocating 1 containers to make up for > (potentially) lost containers > INFO [Thread-116] YarnAllocationHandler: Will Allocate 1 executor containers, > each with 9600 memory > 2: in executor log: > WARN [Connection manager future execution context-13] BlockManagerMaster: > Error sending message to BlockManagerMaster in 1 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:237) > at > org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:51) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:113) > at > org.apache.spark.storage.BlockManager$$anonfun$initialize$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(BlockManager.scala:158) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) > at > org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:158) > at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) > at > akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > WARN [Connection manager future execution context-13] BlockManagerMaster: > Error sending message to BlockManagerMaster in 2 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:237) > at > org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:51)
[jira] [Updated] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred
[ https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1667: - Target Version/s: 1.1.0, 1.0.3 (was: 1.1.0, 1.0.2) > Jobs never finish successfully once bucket file missing occurred > > > Key: SPARK-1667 > URL: https://issues.apache.org/jira/browse/SPARK-1667 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.0.0 >Reporter: Kousuke Saruta > > If jobs execute shuffle, bucket files are created in a temporary directory > (named like spark-local-*). > When the bucket files are missing cased by disk failure or any reasons, jobs > cannot execute shuffle which has same shuffle id for the bucket files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2558) Mention --queue argument in YARN documentation
[ https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2558: - Target Version/s: 1.1.0, 1.0.3 (was: 1.1.0, 1.0.2) > Mention --queue argument in YARN documentation > --- > > Key: SPARK-2558 > URL: https://issues.apache.org/jira/browse/SPARK-2558 > Project: Spark > Issue Type: Documentation > Components: YARN >Reporter: Matei Zaharia >Priority: Trivial > Labels: Starter > > The docs about it went away when we updated the page to spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2576: - Target Version/s: 1.1.0, 1.0.3 (was: 1.1.0, 1.0.2) > slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark > QL query on HDFS CSV file > -- > > Key: SPARK-2576 > URL: https://issues.apache.org/jira/browse/SPARK-2576 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.0.1 > Environment: One Mesos 0.19 master without zookeeper and 4 mesos > slaves. > JDK 1.7.51 and Scala 2.10.4 on all nodes. > HDFS from CDH5.0.3 > Spark version: I tried both with the pre-built CDH5 spark package available > from http://spark.apache.org/downloads.html and by packaging spark with sbt > 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here > http://mesosphere.io/learn/run-spark-on-mesos/ > All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have >Reporter: Svend Vanderveken >Assignee: Yin Huai >Priority: Blocker > > Execution of SQL query against HDFS systematically throws a class not found > exception on slave nodes when executing . > (this was originally reported on the user list: > http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) > Sample code (ran from spark-shell): > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Car(timestamp: Long, objectid: String, isGreen: Boolean) > // I get the same error when pointing to the folder > "hdfs://vm28:8020/test/cardata" > val data = sc.textFile("hdfs://vm28:8020/test/cardata/part-0") > val cars = data.map(_.split(",")).map ( ar => Car(ar(0).toLong, ar(1), > ar(2).toBoolean)) > cars.registerAsTable("mcars") > val allgreens = sqlContext.sql("SELECT objectid from mcars where isGreen = > true") > allgreens.collect.take(10).foreach(println) > {code} > Stack trace on the slave nodes: > {code} > I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 > I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave > 20140714-142853-485682442-5050-25487-2 > 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as > executor ID 20140714-142853-485682442-5050-25487-2 > 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: > mesos,mnubohadoop > 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mesos, > mnubohadoop) > 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started > 14/07/16 13:01:17 INFO Remoting: Starting remoting > 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@vm23:38230] > 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@vm23:38230] > 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: > akka.tcp://spark@vm28:41632/user/MapOutputTracker > 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: > akka.tcp://spark@vm28:41632/user/BlockManagerMaster > 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140716130117-8ea0 > 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 > MB. > 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id > = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) > 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager > 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager > 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 > 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server > 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973 > 14/07/16 13:01:18 INFO Executor: Running task ID 2 > 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 > 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with > curMem=0, maxMem=309225062 > 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 122.6 KB, free 294.8 MB) > 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took > 0.294602722 s > 14/07/16 13:01:19 INFO HadoopRDD: Input split: > hdfs://vm28:8020/test/cardata/part-0:23960450+23960451 > I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown > 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2 > java.lang.NoClassDefFoundError: $line11/$read$ > at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) > at $line12.$read$$iwC$$
[jira] [Updated] (SPARK-2425) Standalone Master is too aggressive in removing Applications
[ https://issues.apache.org/jira/browse/SPARK-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2425: - Target Version/s: 1.0.3 (was: 1.0.2) > Standalone Master is too aggressive in removing Applications > > > Key: SPARK-2425 > URL: https://issues.apache.org/jira/browse/SPARK-2425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra > > When standalone Executors trying to run a particular Application fail a > cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the > Application. This will be true even if there actually are a number of > Executors that are successfully running the Application. This makes > long-running standalone-mode Applications in particular unnecessarily > vulnerable to limited failures in the cluster -- e.g., a single bad node on > which Executors repeatedly fail for any reason can prevent an Application > from starting or can result in a running Application being removed even > though it could continue to run successfully (just not making use of all > potential Workers and Executors.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2541) Standalone mode can't access secure HDFS anymore
[ https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2541: - Target Version/s: 1.1.0, 1.0.3 (was: 1.1.0, 1.0.2) > Standalone mode can't access secure HDFS anymore > > > Key: SPARK-2541 > URL: https://issues.apache.org/jira/browse/SPARK-2541 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0, 1.0.1 >Reporter: Thomas Graves > > In spark 0.9.x you could access secure HDFS from Standalone deploy, that > doesn't work in 1.X anymore. > It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it > wouldn't do the doAs if the currentUser == user. Not sure how it affects > when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2529) Clean the closure in foreach and foreachPartition
[ https://issues.apache.org/jira/browse/SPARK-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073964#comment-14073964 ] Apache Spark commented on SPARK-2529: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1583 > Clean the closure in foreach and foreachPartition > - > > Key: SPARK-2529 > URL: https://issues.apache.org/jira/browse/SPARK-2529 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin > > Somehow we didn't clean the closure for foreach and foreachPartition. Should > do that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Fix Version/s: (was: 1.0.0) > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Support log4j log to yarn container log directory
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Affects Version/s: 1.0.0 > Support log4j log to yarn container log directory > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Add variable of yarn log diectory to reference from the log4j configuration
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Summary: Add variable of yarn log diectory to reference from the log4j configuration (was: Support log4j log to yarn container log directory) > Add variable of yarn log diectory to reference from the log4j configuration > --- > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2668) Add variable of yarn log diectory to reference from the log4j configuration
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073982#comment-14073982 ] Peng Zhang commented on SPARK-2668: --- I changed the title to make it more clear. And I found MapReduce also added variable "-Dyarn.app.mapreduce.container.log.dir" to support reference from log4j configuration. So I think Spark on YARN should have one to solve the same issue. > Add variable of yarn log diectory to reference from the log4j configuration > --- > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file append will log to CWD, and files will not be > displayed on YARN UI,and either cannot be aggregated to HDFS log directory > after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Add variable of yarn log diectory to reference from the log4j configuration
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Description: Assign value of yarn container log directory to java opts "spark.yarn.log.dir", So user defined log4j.properties can reference this value and write log to YARN container directory. Otherwise, user defined file appender will only write to container's CWD, and log files in CWD will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: {code} log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log {code} was: Assign value of yarn container log directory to java opts "spark.yarn.log.dir", So user defined log4j.properties can reference this value and write log to YARN container directory. Otherwise, user defined file appender will only write to container's CWD, and files will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: {code} log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log {code} > Add variable of yarn log diectory to reference from the log4j configuration > --- > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file appender will only write to container's CWD, and > log files in CWD will not be displayed on YARN UI,and either cannot be > aggregated to HDFS log directory after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Add variable of yarn log diectory to reference from the log4j configuration
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Description: Assign value of yarn container log directory to java opts "spark.yarn.log.dir", So user defined log4j.properties can reference this value and write log to YARN container directory. Otherwise, user defined file appender will only write to container's CWD, and files will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: {code} log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log {code} was: Assign value of yarn container log directory to java opts "spark.yarn.log.dir", So user defined log4j.properties can reference this value and write log to YARN container directory. Otherwise, user defined file append will log to CWD, and files will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: {code} log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log {code} > Add variable of yarn log diectory to reference from the log4j configuration > --- > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file appender will only write to container's CWD, and > files will not be displayed on YARN UI,and either cannot be aggregated to > HDFS log directory after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Add variable of yarn log directory to reference from the log4j configuration
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Summary: Add variable of yarn log directory to reference from the log4j configuration (was: Add variable of yarn log diectory to reference from the log4j configuration) > Add variable of yarn log directory to reference from the log4j configuration > > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file appender will only write to container's CWD, and > log files in CWD will not be displayed on YARN UI,and either cannot be > aggregated to HDFS log directory after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Add variable of yarn log directory for reference from the log4j configuration
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Summary: Add variable of yarn log directory for reference from the log4j configuration (was: Add variable of yarn log directory to reference from the log4j configuration) > Add variable of yarn log directory for reference from the log4j configuration > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container directory. > Otherwise, user defined file appender will only write to container's CWD, and > log files in CWD will not be displayed on YARN UI,and either cannot be > aggregated to HDFS log directory after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2668) Add variable of yarn log directory for reference from the log4j configuration
[ https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated SPARK-2668: -- Description: Assign value of yarn container log directory to java opts "spark.yarn.log.dir", So user defined log4j.properties can reference this value and write log to YARN container's log directory. Otherwise, user defined file appender will only write to container's CWD, and log files in CWD will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: {code} log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log {code} was: Assign value of yarn container log directory to java opts "spark.yarn.log.dir", So user defined log4j.properties can reference this value and write log to YARN container directory. Otherwise, user defined file appender will only write to container's CWD, and log files in CWD will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: {code} log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log {code} > Add variable of yarn log directory for reference from the log4j configuration > - > > Key: SPARK-2668 > URL: https://issues.apache.org/jira/browse/SPARK-2668 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Peng Zhang > > Assign value of yarn container log directory to java opts > "spark.yarn.log.dir", So user defined log4j.properties can reference this > value and write log to YARN container's log directory. > Otherwise, user defined file appender will only write to container's CWD, and > log files in CWD will not be displayed on YARN UI,and either cannot be > aggregated to HDFS log directory after job finished. > User defined log4j.properties reference example: > {code} > log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2387) Remove the stage barrier for better resource utilization
[ https://issues.apache.org/jira/browse/SPARK-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073989#comment-14073989 ] Rui Li commented on SPARK-2387: --- [~kayousterhout] Thanks for the review. I tested this PoC with graphx.SynthBenchmark, the test is done on a 7-node cluster, and each node runs an executor with 32 CPUs and 90GB memory. For certain case (-numEPart=112 -nverts=1000 -niter=3), it improves the job by about 10%. I did notice there's regression for very small cases. I think this is because the shuffle map stage is quite short and the overlap is not as obvious. Actually I think a very long map straggler is the best use case for this PoC, in that all the pre-started reducers are waiting for output from just one map task. When that map task finishes, all the waiting reducers can finish almost at the same time. Therefore in this case, we can almost cut off the whole execution time of the reduce stage, compared to normal mode with stage barriers. I agree the early reducers can prevent other jobs from being launched, but I suppose it only happens if multiple jobs are submitted via a spark context concurrently. Not sure if this is a common case? The PoC may also provide flexibility for different shuffle implementations. For example, in a push-style shuffle, the pushed data won't have to be stored to disk if the reducers have started on the destination node. This PoC indeed has the potential to cause some "deadlock" i.e. the pre-started reducers take up all the slots while there're pending map tasks. I try to avoid this in the PR by checking free slots before launching the reduce stage, and giving the pre-started stage lower priority when scheduling. But it doesn't solve the issue perfectly due to delay schedule. And unfortunately map tasks are more likely to be delayed because of locality preference. The deadlock may also occur with task fail over, when a map task reports success but we failed to get the task result later (not common because a map status is usually small enough to fit into a direct result). We talked about this problem a bit. It seems that to solve it, we either have to pump some physical resource information (e.g. free/total slots) to DAGScheduler, or push some dependency information to TaskScheduler. Either way seems to involve many modifications and somehow violate the current design principle of schedulers. So I left this open and want to see if you guys have any ideas on this... > Remove the stage barrier for better resource utilization > > > Key: SPARK-2387 > URL: https://issues.apache.org/jira/browse/SPARK-2387 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Rui Li > > DAGScheduler divides a Spark job into multiple stages according to RDD > dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a > shuffle map stage on the map side, and another stage depending on that stage. > Currently, the downstream stage cannot start until all its depended stages > have finished. This barrier between stages leads to idle slots when waiting > for the last few upstream tasks to finish and thus wasting cluster resources. > Therefore we propose to remove the barrier and pre-start the reduce stage > once there're free slots. This can achieve better resource utilization and > improve the overall job performance, especially when there're lots of > executors granted to the application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074003#comment-14074003 ] Hossein Falaki commented on SPARK-2515: --- If we really have to implement another chi-square test method, I think Likelihood-ratio test would be a good candidate. On the return type: * What is left for the Summary field? Why can't this be the toString method? * I am not sure, but maybe this df is too cryptic for non-experts. How about degreesOfFreedom? > Hypothesis testing > -- > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2529) Clean the closure in foreach and foreachPartition
[ https://issues.apache.org/jira/browse/SPARK-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074002#comment-14074002 ] Mark Hamstra commented on SPARK-2529: - Actually, we were cleaning those closures, but that was removed in https://github.com/apache/spark/commit/6b288b75d4c05f42ad3612813dc77ff824bb6203 -- not sure why. > Clean the closure in foreach and foreachPartition > - > > Key: SPARK-2529 > URL: https://issues.apache.org/jira/browse/SPARK-2529 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin > > Somehow we didn't clean the closure for foreach and foreachPartition. Should > do that. -- This message was sent by Atlassian JIRA (v6.2#6252)