[jira] [Commented] (SPARK-2172) PySpark cannot import mllib modules in YARN-client mode
[ https://issues.apache.org/jira/browse/SPARK-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108783#comment-14108783 ] Joao Salcedo commented on SPARK-2172: - [~piotrszul] For the fix there is a workaround that I can use in my python script ? PySpark cannot import mllib modules in YARN-client mode --- Key: SPARK-2172 URL: https://issues.apache.org/jira/browse/SPARK-2172 Project: Spark Issue Type: Bug Components: MLlib, PySpark, Spark Core, YARN Affects Versions: 1.0.0, 1.1.0 Environment: Ubuntu 14.04 Java 7 Python 2.7 CDH 5.0.2 (Hadoop 2.3.0): HDFS, YARN Spark 1.0.0 and git master Reporter: Vlad Frolov Labels: mllib, python Fix For: 1.0.1, 1.1.0 Here is the simple reproduce code: {noformat} $ HADOOP_CONF_DIR=/etc/hadoop/conf MASTER=yarn-client ./bin/pyspark {noformat} {code:title=issue.py|borderStyle=solid} from pyspark.mllib.regression import LabeledPoint sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).count() {code} Note: The same issue occurs with .collect() instead of .count() {code:title=TraceBack|borderStyle=solid} Py4JJavaError: An error occurred while calling o110.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 8.0:0 failed 4 times, most recent failure: Exception failure in TID 52 on host ares: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/worker.py, line 73, in main command = pickleSer._read_with_length(infile) File /mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) ImportError: No module named mllib.regression org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} However, this code works as expected: {code:title=noissue.py|borderStyle=solid} from pyspark.mllib.regression import LabeledPoint sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).first()
[jira] [Commented] (SPARK-3190) Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow somewhere
[ https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108794#comment-14108794 ] npanj commented on SPARK-3190: -- Thanks Ankur for patch. I can confirm that this pull request fixed the issue. Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow somewhere --- Key: SPARK-3190 URL: https://issues.apache.org/jira/browse/SPARK-3190 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.3 Environment: Standalone mode running on EC2 . Using latest code from master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 . Reporter: npanj Assignee: Ankur Dave Priority: Critical While creating a graph with 6B nodes and 12B edges, I noticed that 'numVertices' api returns incorrect result; 'numEdges' reports correct number. For few times(with different dataset 2.5B nodes) I have also notices that numVertices is returned as -ive number; so I suspect that there is some overflow (may be we are using Int for some field?). Here is some details of experiments I have done so far: 1. Input: numNodes=6101995593 ; noEdges=12163784626 Graph returns: numVertices=1807028297 ; numEdges=12163784626 2. Input : numNodes=2157586441 ; noEdges=2747322705 Graph Returns: numVertices=-2137380855 ; numEdges=2747322705 3. Input: numNodes=1725060105 ; noEdges=204176821 Graph: numVertices=1725060105 ; numEdges=2041768213 You can find the code to generate this bug here: https://gist.github.com/npanj/92e949d86d08715bf4bf Note: Nodes are labeled are 1...6B . -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2805) update akka to version 2.3
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108800#comment-14108800 ] Anand Avati commented on SPARK-2805: [~pwendell] ping update akka to version 2.3 -- Key: SPARK-2805 URL: https://issues.apache.org/jira/browse/SPARK-2805 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati akka-2.3 is the lowest version available in Scala 2.11 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order to reconcile the conflicting dependencies, need to release akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3197) Reduce the expression tree object creation from the aggregation functions (min/max)
Cheng Hao created SPARK-3197: Summary: Reduce the expression tree object creation from the aggregation functions (min/max) Key: SPARK-3197 URL: https://issues.apache.org/jira/browse/SPARK-3197 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3196) Expression Evaluation Performance Improvement
Cheng Hao created SPARK-3196: Summary: Expression Evaluation Performance Improvement Key: SPARK-3196 URL: https://issues.apache.org/jira/browse/SPARK-3196 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao The expression id generations depend on a atomic long object internally, which will cause the performance drop dramatically in a multi-threading execution. I'd like to create 2 sub tasks(maybe more) for the improvements: 1) Reduce the expression tree object creation from the aggregation functions (min/max), as they will create expression trees for each single row. 2) Improve the expression id generation algorithm, by not using the AtomicLong. And remove the expression object creation as many as possible, where we have the expression evaluation. (I will create couple of subtask soon). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3198) Improve the expression id generation algorithm
Cheng Hao created SPARK-3198: Summary: Improve the expression id generation algorithm Key: SPARK-3198 URL: https://issues.apache.org/jira/browse/SPARK-3198 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, Catalyst harnesses the AtomicLong for the expression id generation algorithm, which reduce the performance dramatically in a multithread env. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3193) output errer info when Process exitcode not zero
[ https://issues.apache.org/jira/browse/SPARK-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-3193. Resolution: Invalid [Verbatim from my comment on PR ] Hey !, Thanks for raising this concern. The convention in spark is that we look in the [sub-project]/target/unit-tests.log. And this is applicable to all test suits. So when you saw a particular test fail on the Jenkins, you can rerun that test locally and then check that unit-tests.log file for that sub-project. I hope this helps. You can close this PR if you are convinced. P.S: May be we can expand our wiki page with this information. output errer info when Process exitcode not zero Key: SPARK-3193 URL: https://issues.apache.org/jira/browse/SPARK-3193 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: wangfei I noticed that sometimes pr tests failed due to the Process exitcode != 0: DriverSuite: Spark assembly has been built with Hive, including Datanucleus jars on classpath - driver should exit after finishing *** FAILED *** SparkException was thrown during property evaluation. (DriverSuite.scala:40) Message: Process List(./bin/spark-class, org.apache.spark.DriverWithoutCleanup, local) exited with code 1 Occurred at table row 0 (zero based, not counting headings), which had values ( master = local ) [info] SparkSubmitSuite: [info] - prints usage on empty input [info] - prints usage with only --help [info] - prints error with unrecognized options [info] - handle binary specified but not class [info] - handles arguments with --key=val [info] - handles arguments to user program [info] - handles arguments to user program with name collision [info] - handles YARN cluster mode [info] - handles YARN client mode [info] - handles standalone cluster mode [info] - handles standalone client mode [info] - handles mesos client mode [info] - handles confs with flag equivalents [info] - launch simple application with spark-submit *** FAILED *** [info] org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited with code 1 [info] at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872) [info] at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) [info] at org.apacSpark assembly has been built with Hive, including Datanucleus jars on classpath refer to https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull we should output the process error info when failed, this can be helpful for diagnosis. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3197) Reduce the expression tree object creation from the aggregation functions (min/max)
[ https://issues.apache.org/jira/browse/SPARK-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108869#comment-14108869 ] Apache Spark commented on SPARK-3197: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2113 Reduce the expression tree object creation from the aggregation functions (min/max) --- Key: SPARK-3197 URL: https://issues.apache.org/jira/browse/SPARK-3197 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3199) native Java spark listener API support
Chengxiang Li created SPARK-3199: Summary: native Java spark listener API support Key: SPARK-3199 URL: https://issues.apache.org/jira/browse/SPARK-3199 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Current spark listener API is totally scala style, full of case classes and scala collections, a native Java spark listener API would be much friendly for Java users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2633: - Summary: enhance spark listener API to gather more spark job information (was: support register spark listener to listener bus with Java API) enhance spark listener API to gather more spark job information --- Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Priority: Critical Labels: hive Attachments: Spark listener enhancement for Hive on Spark job monitor and statistic.docx Currently user can only register spark listener with Scala API, we should add this feature to Java API as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2633: - Description: Based on Hive on Spark job status monitoring and statistic collection requirement, try to enhance spark listener API to gather more spark job information. (was: Currently user can only register spark listener with Scala API, we should add this feature to Java API as well.) enhance spark listener API to gather more spark job information --- Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Priority: Critical Labels: hive Attachments: Spark listener enhancement for Hive on Spark job monitor and statistic.docx Based on Hive on Spark job status monitoring and statistic collection requirement, try to enhance spark listener API to gather more spark job information. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3198) Generates the expression id while necessary
[ https://issues.apache.org/jira/browse/SPARK-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-3198: - Summary: Generates the expression id while necessary (was: Improve the expression id generation algorithm) Generates the expression id while necessary --- Key: SPARK-3198 URL: https://issues.apache.org/jira/browse/SPARK-3198 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, Catalyst harnesses the AtomicLong for the expression id generation algorithm, which reduce the performance dramatically in a multithread env. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108887#comment-14108887 ] Chengxiang Li commented on SPARK-2633: -- I would start to work on this issue, for better isolation, this JIRA would focus on spark listener API enhancement, and I've created SPARK-3199 to track native Spark listener API implementation. enhance spark listener API to gather more spark job information --- Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Priority: Critical Labels: hive Attachments: Spark listener enhancement for Hive on Spark job monitor and statistic.docx Based on Hive on Spark job status monitoring and statistic collection requirement, try to enhance spark listener API to gather more spark job information. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3196) Expression Evaluation Performance Improvement
[ https://issues.apache.org/jira/browse/SPARK-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-3196: - Description: The expression id generations depend on a atomic long object internally, which will cause the performance drop dramatically in a multi-threading execution. I'd like to create 2 sub tasks(maybe more) for the improvements: 1) Reduce the expression tree object creation from the aggregation functions (min/max), as they will create expression trees for each single row. 2) Improve the expression id generation algorithm, by not using the AtomicLong, or generate the expression id in necessary. And remove the expression object creation as many as possible, where we have the expression evaluation. (I will create couple of subtask soon). was: The expression id generations depend on a atomic long object internally, which will cause the performance drop dramatically in a multi-threading execution. I'd like to create 2 sub tasks(maybe more) for the improvements: 1) Reduce the expression tree object creation from the aggregation functions (min/max), as they will create expression trees for each single row. 2) Improve the expression id generation algorithm, by not using the AtomicLong. And remove the expression object creation as many as possible, where we have the expression evaluation. (I will create couple of subtask soon). Expression Evaluation Performance Improvement - Key: SPARK-3196 URL: https://issues.apache.org/jira/browse/SPARK-3196 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao The expression id generations depend on a atomic long object internally, which will cause the performance drop dramatically in a multi-threading execution. I'd like to create 2 sub tasks(maybe more) for the improvements: 1) Reduce the expression tree object creation from the aggregation functions (min/max), as they will create expression trees for each single row. 2) Improve the expression id generation algorithm, by not using the AtomicLong, or generate the expression id in necessary. And remove the expression object creation as many as possible, where we have the expression evaluation. (I will create couple of subtask soon). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3198) Generates the expression id while necessary
[ https://issues.apache.org/jira/browse/SPARK-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108891#comment-14108891 ] Apache Spark commented on SPARK-3198: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2114 Generates the expression id while necessary --- Key: SPARK-3198 URL: https://issues.apache.org/jira/browse/SPARK-3198 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, Catalyst harnesses the AtomicLong for the expression id generation algorithm, which reduce the performance dramatically in a multithread env. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3198) Generates the expression id while necessary
[ https://issues.apache.org/jira/browse/SPARK-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108894#comment-14108894 ] Cheng Hao commented on SPARK-3198: -- Usually, we need the expression id in logical plan analyzing, not in evaluation, hence we can get significant improvement by doing this. Generates the expression id while necessary --- Key: SPARK-3198 URL: https://issues.apache.org/jira/browse/SPARK-3198 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, Catalyst harnesses the AtomicLong for the expression id generation algorithm, which reduce the performance dramatically in a multithread env. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3173) Timestamp support in the parser
[ https://issues.apache.org/jira/browse/SPARK-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108936#comment-14108936 ] Teng Qiu commented on SPARK-3173: - vote on this ticket, it seems should link to this PR https://github.com/apache/spark/pull/2084 Timestamp support in the parser --- Key: SPARK-3173 URL: https://issues.apache.org/jira/browse/SPARK-3173 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.0 Reporter: Zdenek Farana If you have a table with TIMESTAMP column, that column can't be used in WHERE clause properly - it is not evaluated properly. F.e., SELECT * FROM a WHERE timestamp='2014-08-21 00:00:00.0', would return nothing even if there would be a row with such a timestamp. The literal is not interpreted into a timestamp. The workaround SELECT * FROM a WHERE timestamp=CAST('2014-08-21 00:00:00.0' AS TIMESTAMP) fails, because the parser does not allow anything but STRING in the CAST dataType expression. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108958#comment-14108958 ] Prashant Sharma commented on SPARK-2620: I just tried these test code snippets on the Spark Repl(built from master), and they pass with expected results. case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
Prashant Sharma created SPARK-3200: -- Summary: Class defined with reference to external variables crashes in REPL. Key: SPARK-3200 URL: https://issues.apache.org/jira/browse/SPARK-3200 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Prashant Sharma Reproducer: {noformat} val a = sc.textFile(README.md).count case class A(i: Int) { val j = a} sc.parallelize(1 to 10).map(A(_)).collect() {noformat} This will happen, when one refers something that refers sc and not otherwise. There are many ways to work around this, like directly assign a constant value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3201) Yarn Client do not support the -X java opts
hzw created SPARK-3201: -- Summary: Yarn Client do not support the -X java opts Key: SPARK-3201 URL: https://issues.apache.org/jira/browse/SPARK-3201 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: hzw In yarn-client mode, it's not allowed to set the spark.driver.extraJavaOptions . I think it's very inconvenient if we want to set the -X java opts in the process of ExecutorLauncher. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3203) ClassNotFound Exception
Rohit Kumar created SPARK-3203: -- Summary: ClassNotFound Exception Key: SPARK-3203 URL: https://issues.apache.org/jira/browse/SPARK-3203 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Ubuntu 12.04, openjdk 64 bit 7u65 Reporter: Rohit Kumar I am using Spark with as processing engine over cassandra. I have only one master and a worker node. I am executing following code in spark-shell : sc.stop import org.apache.spark.SparkContext import org.apache.spark.SparkConf import com.datastax.spark.connector._ val conf = new SparkConf(true).set(spark.cassandra.connection.host, 127.0.0.1) val sc = new SparkContext(spark://L-BXP44Z1:7077, Cassandra Connector Test, conf) val rdd = sc.cassandraTable(test, kv) println(rdd.map(_.getInt(value)).sum) I am getting following error: 14/08/25 18:47:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/08/25 18:49:39 INFO CoarseGrainedExecutorBackend: Got assigned task 0 14/08/25 18:49:39 INFO Executor: Running task ID 0 14/08/25 18:49:39 ERROR Executor: Exception in task ID 0 java.lang.ClassNotFoundException: $line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) at
[jira] [Created] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
Hingorani, Vineet created SPARK-3202: Summary: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD Key: SPARK-3202 URL: https://issues.apache.org/jira/browse/SPARK-3202 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Hingorani, Vineet Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I want to calculate the maximum/average of a column. When I read the file using sc.textFile(test.csv).map(_.split(;), each field is read as string. Could someone help me with the above manipulation and how to do that. Or may be if there is some way to take the transpose of the data and then manipulating the rows in some way? Thank you in advance, I am struggling with this thing for quite sometime Regards, Vineet -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
[ https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hingorani, Vineet closed SPARK-3202. Resolution: Invalid Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD - Key: SPARK-3202 URL: https://issues.apache.org/jira/browse/SPARK-3202 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Hingorani, Vineet Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I want to calculate the maximum/average of a column. When I read the file using sc.textFile(test.csv).map(_.split(;), each field is read as string. Could someone help me with the above manipulation and how to do that. Or may be if there is some way to take the transpose of the data and then manipulating the rows in some way? Thank you in advance, I am struggling with this thing for quite sometime Regards, Vineet -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
[ https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109129#comment-14109129 ] Hingorani, Vineet commented on SPARK-3202: -- Thank you Sean for the helping regarding the platform. :) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD - Key: SPARK-3202 URL: https://issues.apache.org/jira/browse/SPARK-3202 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Hingorani, Vineet Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I want to calculate the maximum/average of a column. When I read the file using sc.textFile(test.csv).map(_.split(;), each field is read as string. Could someone help me with the above manipulation and how to do that. Or may be if there is some way to take the transpose of the data and then manipulating the rows in some way? Thank you in advance, I am struggling with this thing for quite sometime Regards, Vineet -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
[ https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109128#comment-14109128 ] Sean Owen commented on SPARK-3202: -- JIRA is not a good place to ask questions -- please use u...@spark.apache.org. This is for reporting issues, so I'd recommend closing this. Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD - Key: SPARK-3202 URL: https://issues.apache.org/jira/browse/SPARK-3202 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Hingorani, Vineet Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I want to calculate the maximum/average of a column. When I read the file using sc.textFile(test.csv).map(_.split(;), each field is read as string. Could someone help me with the above manipulation and how to do that. Or may be if there is some way to take the transpose of the data and then manipulating the rows in some way? Thank you in advance, I am struggling with this thing for quite sometime Regards, Vineet -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3201) Yarn Client do not support the -X java opts
[ https://issues.apache.org/jira/browse/SPARK-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109131#comment-14109131 ] Apache Spark commented on SPARK-3201: - User 'hzw19900416' has created a pull request for this issue: https://github.com/apache/spark/pull/2115 Yarn Client do not support the -X java opts - Key: SPARK-3201 URL: https://issues.apache.org/jira/browse/SPARK-3201 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: hzw In yarn-client mode, it's not allowed to set the spark.driver.extraJavaOptions . I think it's very inconvenient if we want to set the -X java opts in the process of ExecutorLauncher. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3204) MaxOf would be foldable if both left and right are foldable.
Takuya Ueshin created SPARK-3204: Summary: MaxOf would be foldable if both left and right are foldable. Key: SPARK-3204 URL: https://issues.apache.org/jira/browse/SPARK-3204 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3206) Error in PageRank values
[ https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Fontana updated SPARK-3206: - Description: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | Node1 | Node2 | |1 | 2 | |1 |3| |3 |2| |3 |4| |5 |3| |6 |7| |7 |8| |8 |9| |9 |7| Node Table (note the extra node): | NodeID | NodeName | |a |1| |b |2| |c |3| |d |4| |e |5| |f |6| |g |7| |h |8| |i |9| |j.longaddress.com |10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running ``` val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices ``` I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running val ranksI = PageRank.run(graph,100).vertices I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. was: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | Node1 | Node2 | |1 | 2 | |1 |3| 3 | 2 3 | 4 5 | 3 6 | 7 7 | 8 8 | 9 9 | 7 Node Table (note the extra node): | NodeID | NodeName | | - | - | a | 1 b | 2 c | 3 d | 4 e | 5 f | 6 g | 7 h | 8 i | 9 j.longaddress.com | 10 with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running val ranksI = PageRank.run(graph,100).vertices I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. Error in PageRank values Key: SPARK-3206 URL: https://issues.apache.org/jira/browse/SPARK-3206 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Environment: UNIX with Hadoop Reporter: Peter Fontana I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | Node1 | Node2 | |1 | 2 | |1 | 3| |3 | 2| |3 | 4| |5 | 3| |6 | 7| |7 | 8| |8 | 9| |9 | 7| Node Table (note the extra node): | NodeID | NodeName | |a | 1| |b | 2| |c | 3| |d | 4| |e | 5| |f | 6| |g | 7| |h | 8| |i | 9| |j.longaddress.com | 10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running ``` val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices ``` I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running val ranksI = PageRank.run(graph,100).vertices I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3206) Error in PageRank values
[ https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Fontana updated SPARK-3206: - Description: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | |Node1|||Node2 | | |1 | 2 | |1 |3| |3 |2| |3 |4| |5 |3| |6 |7| |7 |8| |8 |9| |9 |7| Node Table (note the extra node): || NodeID || NodeName || |a |1| |b |2| |c |3| |d |4| |e |5| |f |6| |g |7| |h |8| |i |9| |j.longaddress.com |10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. was: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | |Node1|||Node2 | | |1 | 2 | |1 |3| |3 |2| |3 |4| |5 |3| |6 |7| |7 |8| |8 |9| |9 |7| Node Table (note the extra node): || NodeID || NodeName || |a |1| |b |2| |c |3| |d |4| |e |5| |f |6| |g |7| |h |8| |i |9| |j.longaddress.com |10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{ val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. Error in PageRank values Key: SPARK-3206 URL: https://issues.apache.org/jira/browse/SPARK-3206 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Environment: UNIX with Hadoop Reporter: Peter Fontana I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | |Node1|||Node2 | | |1 | 2 | |1 | 3| |3 | 2| |3 | 4| |5 | 3| |6 | 7| |7 | 8| |8 | 9| |9 | 7| Node Table (note the extra node): || NodeID || NodeName || |a | 1| |b | 2| |c | 3| |d | 4| |e | 5| |f | 6| |g | 7| |h | 8| |i | 9| |j.longaddress.com | 10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3206) Error in PageRank values
[ https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Fontana updated SPARK-3206: - Description: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | |Node1|||Node2 | | |1 | 2 | |1 |3| |3 |2| |3 |4| |5 |3| |6 |7| |7 |8| |8 |9| |9 |7| Node Table (note the extra node): || NodeID || NodeName || |a |1| |b |2| |c |3| |d |4| |e |5| |f |6| |g |7| |h |8| |i |9| |j.longaddress.com |10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{ val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. was: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | Node1 | Node2 | |1 | 2 | |1 |3| |3 |2| |3 |4| |5 |3| |6 |7| |7 |8| |8 |9| |9 |7| Node Table (note the extra node): | NodeID | NodeName | |a |1| |b |2| |c |3| |d |4| |e |5| |f |6| |g |7| |h |8| |i |9| |j.longaddress.com |10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running ``` val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices ``` I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running val ranksI = PageRank.run(graph,100).vertices I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. Error in PageRank values Key: SPARK-3206 URL: https://issues.apache.org/jira/browse/SPARK-3206 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Environment: UNIX with Hadoop Reporter: Peter Fontana I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | |Node1|||Node2 | | |1 | 2 | |1 | 3| |3 | 2| |3 | 4| |5 | 3| |6 | 7| |7 | 8| |8 | 9| |9 | 7| Node Table (note the extra node): || NodeID || NodeName || |a | 1| |b | 2| |c | 3| |d | 4| |e | 5| |f | 6| |g | 7| |h | 8| |i | 9| |j.longaddress.com | 10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{ val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3206) Error in PageRank values
[ https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Fontana updated SPARK-3206: - Description: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: || Node1 || Node2 || |1 | 2 | |1 |3| |3 |2| |3 |4| |5 |3| |6 |7| |7 |8| |8 |9| |9 |7| Node Table (note the extra node): || NodeID || NodeName || |a |1| |b |2| |c |3| |d |4| |e |5| |f |6| |g |7| |h |8| |i |9| |j.longaddress.com |10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. was: I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: | |Node1|||Node2 | | |1 | 2 | |1 |3| |3 |2| |3 |4| |5 |3| |6 |7| |7 |8| |8 |9| |9 |7| Node Table (note the extra node): || NodeID || NodeName || |a |1| |b |2| |c |3| |d |4| |e |5| |f |6| |g |7| |h |8| |i |9| |j.longaddress.com |10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. Error in PageRank values Key: SPARK-3206 URL: https://issues.apache.org/jira/browse/SPARK-3206 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Environment: UNIX with Hadoop Reporter: Peter Fontana I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: || Node1 || Node2 || |1 | 2 | |1 | 3| |3 | 2| |3 | 4| |5 | 3| |6 | 7| |7 | 8| |8 | 9| |9 | 7| Node Table (note the extra node): || NodeID || NodeName || |a | 1| |b | 2| |c | 3| |d | 4| |e | 5| |f | 6| |g | 7| |h | 8| |i | 9| |j.longaddress.com | 10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2189) Method for removing temp tables created by registerAsTable
[ https://issues.apache.org/jira/browse/SPARK-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109378#comment-14109378 ] Michael Armbrust commented on SPARK-2189: - Thanks for offering to work on this. Can you briefly describe what you plan to do here? I think there are some subtle interface questions at the moment due to the way we handle cached tables vs temporary tables. Specifically, what happens when you cache a table and then call unregisterTempTable(cachedTableName). Method for removing temp tables created by registerAsTable -- Key: SPARK-2189 URL: https://issues.apache.org/jira/browse/SPARK-2189 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3207) Choose splits for continuous features in DecisionTree more adaptively
Joseph K. Bradley created SPARK-3207: Summary: Choose splits for continuous features in DecisionTree more adaptively Key: SPARK-3207 URL: https://issues.apache.org/jira/browse/SPARK-3207 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor DecisionTree splits on continuous features by choosing an array of values from a subsample of the data. Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. This is not an error, but it could be improved to be more adaptive to the data. Proposal: In findSplitsBins, check for identical values, and do some searching in order to find a set of unique splits. Reduce the number of splits if there are not enough unique candidates. This would require modifying findSplitsBins and making sure that the number of splits/bins (chosen adaptively) is set correctly elsewhere in the code (such as in DecisionTreeMetadata). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3147) Implement A/B testing
[ https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109402#comment-14109402 ] Michael Yannakopoulos commented on SPARK-3147: -- Hi Xiangrui, It would be my pleasure to help in the implementation of this task. Not only it would enhance my coding skills but it would also help me learn better the theory behind the statistic tests that exist. If you have time and you would like to work together, I would be glad. Thanks, Michael Implement A/B testing - Key: SPARK-3147 URL: https://issues.apache.org/jira/browse/SPARK-3147 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Xiangrui Meng A/B testing is widely used to compare online models. We can implement A/B testing in MLlib and integrate it with Spark Streaming. For example, we have a PairDStream[String, Double], whose keys are model ids and values are observations (click or not, or revenue associated with the event). With A/B testing, we can tell whether one model is significantly better than another at a certain time. There are some caveats. For example, we should avoid multiple testing and support A/A testing as a sanity check. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3184) Allow user to specify num tasks to use for a table
[ https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109439#comment-14109439 ] Michael Armbrust commented on SPARK-3184: - Yeah, it looks like is is actually implemented. Thought it would be nice for us to have a real way to do it (instead of hijacking hives way) and to also print a deprecation warning when using the hive way). For those reasons I think we can leave this open but decrease the priority. Allow user to specify num tasks to use for a table -- Key: SPARK-3184 URL: https://issues.apache.org/jira/browse/SPARK-3184 Project: Spark Issue Type: Improvement Components: SQL Reporter: Andy Konwinski -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3184) Allow user to specify num tasks to use for a table
[ https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3184: Priority: Minor (was: Major) Allow user to specify num tasks to use for a table -- Key: SPARK-3184 URL: https://issues.apache.org/jira/browse/SPARK-3184 Project: Spark Issue Type: Improvement Components: SQL Reporter: Andy Konwinski Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3208) Hive Parquet SerDe returns null columns
Michael Armbrust created SPARK-3208: --- Summary: Hive Parquet SerDe returns null columns Key: SPARK-3208 URL: https://issues.apache.org/jira/browse/SPARK-3208 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Priority: Minor There is a workaround, which is to set 'spark.sql.hive.convertMetastoreParquet=true'. However, it would still be good to figure out what is going on here. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3209) bump the version in banner
Davies Liu created SPARK-3209: - Summary: bump the version in banner Key: SPARK-3209 URL: https://issues.apache.org/jira/browse/SPARK-3209 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Davies Liu Priority: Blocker daviesliu@dm:~/work/spark$ ../spark/bin/spark-shell Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ daviesliu@dm:~/work/spark$ ./bin/pyspark Python 2.7.5 (default, Mar 9 2014, 22:15:05) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Mar 9 2014 22:15:05) SparkContext available as sc. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3140) PySpark start-up throws confusing exception
[ https://issues.apache.org/jira/browse/SPARK-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3140. --- Resolution: Fixed Assignee: Andrew Or PySpark start-up throws confusing exception --- Key: SPARK-3140 URL: https://issues.apache.org/jira/browse/SPARK-3140 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.1.0 Currently we read the pyspark port through stdout of the spark-submit subprocess. However, if there is stdout interference, e.g. spark-submit echoes something unexpected to stdout, we print the following: {code} Exception: Launching GatewayServer failed! (Warning: unexpected output detected.) {code} This condition is fine. However, we actually throw the same exception if there is *no* output from the subprocess as well. This is very confusing because it implies that the subprocess is outputting something (possibly whitespace, which is not visible) when it's actually not. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109512#comment-14109512 ] Davies Liu commented on SPARK-1764: --- This issue should be fixed in SPARK-2282 [1], I had ran the jobs above against mesos-0.19.1 after more than a hour without problems. [~therealnb] Could you also verify this? [1] https://github.com/apache/spark/commit/ef4ff00f87a4e8d38866f163f01741c2673e41da EOF reached before Python server acknowledged - Key: SPARK-1764 URL: https://issues.apache.org/jira/browse/SPARK-1764 Project: Spark Issue Type: Bug Components: Mesos, PySpark Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Priority: Blocker Labels: mesos, pyspark I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3210) Flume Polling Receiver must be more tolerant to connection failures.
Hari Shreedharan created SPARK-3210: --- Summary: Flume Polling Receiver must be more tolerant to connection failures. Key: SPARK-3210 URL: https://issues.apache.org/jira/browse/SPARK-3210 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3209) bump the version in banner
[ https://issues.apache.org/jira/browse/SPARK-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-3209. - Resolution: Invalid The version number is correct in branch-1.1. bump the version in banner -- Key: SPARK-3209 URL: https://issues.apache.org/jira/browse/SPARK-3209 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Davies Liu Priority: Blocker daviesliu@dm:~/work/spark$ ../spark/bin/spark-shell Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ daviesliu@dm:~/work/spark$ ./bin/pyspark Python 2.7.5 (default, Mar 9 2014, 22:15:05) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Mar 9 2014 22:15:05) SparkContext available as sc. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2840) Improve documentation for decision tree
[ https://issues.apache.org/jira/browse/SPARK-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2840. --- Resolution: Fixed Fix Version/s: 1.1.0 Improve documentation for decision tree --- Key: SPARK-2840 URL: https://issues.apache.org/jira/browse/SPARK-2840 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Fix For: 1.1.0 1. add code examples for Python/Java 2. add documentation for multiclass classification -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3044) Create RSS feed for Spark News
[ https://issues.apache.org/jira/browse/SPARK-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109571#comment-14109571 ] Michael Yannakopoulos commented on SPARK-3044: -- Hi Nicholas, I am really interested to work on this issue. Do you know where I can find the source code of the official [Apache Spark site|http://spark.apache.org]? Thanks, Michael Create RSS feed for Spark News -- Key: SPARK-3044 URL: https://issues.apache.org/jira/browse/SPARK-3044 Project: Spark Issue Type: Documentation Reporter: Nicholas Chammas Priority: Minor Project updates are often posted here: http://spark.apache.org/news/ Currently, there is no way to subscribe to a feed of these updates. It would be nice there was a way people could be notified of new posts there without having to check manually. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3211) .take() is OOM-prone when there are empty partitions
Andrew Ash created SPARK-3211: - Summary: .take() is OOM-prone when there are empty partitions Key: SPARK-3211 URL: https://issues.apache.org/jira/browse/SPARK-3211 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Andrew Ash Filed on dev@ on 22 August by [~pnepywoda]: {quote} On line 777 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771 the logic for take() reads ALL partitions if the first one (or first k) are empty. This has actually lead to OOMs when we had many partitions (thousands) and unfortunately the first one was empty. Wouldn't a better implementation strategy be numPartsToTry = partsScanned * 2 instead of numPartsToTry = totalParts - 1 (this doubling is similar to most memory allocation strategies) Thanks! - Paul {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109583#comment-14109583 ] nigel commented on SPARK-1764: -- Hi; Sadly I moved jobs and I don't have a working Spark environment at the moment (I will be doing some Spark work soon :-). I'll pass this on to the guys that are still there and get them to confirm. Cheers EOF reached before Python server acknowledged - Key: SPARK-1764 URL: https://issues.apache.org/jira/browse/SPARK-1764 Project: Spark Issue Type: Bug Components: Mesos, PySpark Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Assignee: Davies Liu Priority: Blocker Labels: mesos, pyspark I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109587#comment-14109587 ] Michael Armbrust commented on SPARK-2087: - You can't make temporary tables yet, but you will be able to when we add the CACHE TABLE ... AS SELECT... syntax https://issues.apache.org/jira/browse/SPARK-2594. Clean Multi-user semantics for thrift JDBC/ODBC server. --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor Configuration and temporary tables should exist per-user. Cached tables should be shared across users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3211) .take() is OOM-prone when there are empty partitions
[ https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109591#comment-14109591 ] Apache Spark commented on SPARK-3211: - User 'ash211' has created a pull request for this issue: https://github.com/apache/spark/pull/2117 .take() is OOM-prone when there are empty partitions Key: SPARK-3211 URL: https://issues.apache.org/jira/browse/SPARK-3211 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Andrew Ash Filed on dev@ on 22 August by [~pnepywoda]: {quote} On line 777 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771 the logic for take() reads ALL partitions if the first one (or first k) are empty. This has actually lead to OOMs when we had many partitions (thousands) and unfortunately the first one was empty. Wouldn't a better implementation strategy be numPartsToTry = partsScanned * 2 instead of numPartsToTry = totalParts - 1 (this doubling is similar to most memory allocation strategies) Thanks! - Paul {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3205) input format for text records saved with in-record delimiter and newline characters escaped
[ https://issues.apache.org/jira/browse/SPARK-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109592#comment-14109592 ] Apache Spark commented on SPARK-3205: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/2118 input format for text records saved with in-record delimiter and newline characters escaped --- Key: SPARK-3205 URL: https://issues.apache.org/jira/browse/SPARK-3205 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Reporter: Xiangrui Meng Assignee: Xiangrui Meng Text records may contain in-record delimiter or newline characters. In such cases, we can either encode them or escape them. The latter is simpler and used by Redshift's UNLOAD with the ESCAPE option. The problem is that a record will span multiple lines. We need an input format for it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2495) Ability to re-create ML models
[ https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2495. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 2112 [https://github.com/apache/spark/pull/2112] Ability to re-create ML models -- Key: SPARK-2495 URL: https://issues.apache.org/jira/browse/SPARK-2495 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Alexander Albul Assignee: Alexander Albul Fix For: 1.1.0 Hi everyone. Previously (prior to Spark 1.0) we was working with MLib like this: 1) Calculate model (costly operation) 2) Take model and collect it's fields like weights, intercept e.t.c. 3) Store model somewhere in our format 4) Do predictions by loading model attributes, creating new model and predicting using it. Now i see that model's constructors have *private* modifier and cannot be created from outside. If you want to hide implementation details and keep this constructor as developer api, why not to create at least method, which will take weights, intercept (for example) an materialize that model? A good example of model that i am talking about is: *LinearRegressionModel* I know that *LinearRegressionWithSGD* class have *createModel* method but the problem is that it have *protected* modifier as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2798) Correct several small errors in Flume module pom.xml files
[ https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-2798. -- Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Sean Owen Correct several small errors in Flume module pom.xml files -- Key: SPARK-2798 URL: https://issues.apache.org/jira/browse/SPARK-2798 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink pom.xml - scalatest is not declared as a test-scope dependency - Its Avro version doesn't match the rest of the build - Its Flume version is not synced with the other Flume module - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version - It depends on Scala Lang directly, which it shouldn't -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2798) Correct several small errors in Flume module pom.xml files
[ https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2798: - Affects Version/s: (was: 1.0.1) Correct several small errors in Flume module pom.xml files -- Key: SPARK-2798 URL: https://issues.apache.org/jira/browse/SPARK-2798 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink pom.xml - scalatest is not declared as a test-scope dependency - Its Avro version doesn't match the rest of the build - Its Flume version is not synced with the other Flume module - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version - It depends on Scala Lang directly, which it shouldn't -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files
[ https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109675#comment-14109675 ] Sean Owen commented on SPARK-2798: -- [~tdas] Cool, I think this closes SPARK-3169 too if I understand correctly Correct several small errors in Flume module pom.xml files -- Key: SPARK-2798 URL: https://issues.apache.org/jira/browse/SPARK-2798 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink pom.xml - scalatest is not declared as a test-scope dependency - Its Avro version doesn't match the rest of the build - Its Flume version is not synced with the other Flume module - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version - It depends on Scala Lang directly, which it shouldn't -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3044) Create RSS feed for Spark News
[ https://issues.apache.org/jira/browse/SPARK-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109677#comment-14109677 ] Nicholas Chammas commented on SPARK-3044: - Hi Michael, I don't know if the site itself is open-source. We might need someone from Databricks to update it. [~pwendell], [~rxin] - Is it possible for contributors to contribute to the [main Spark site|http://spark.apache.org/]? Create RSS feed for Spark News -- Key: SPARK-3044 URL: https://issues.apache.org/jira/browse/SPARK-3044 Project: Spark Issue Type: Documentation Reporter: Nicholas Chammas Priority: Minor Project updates are often posted here: http://spark.apache.org/news/ Currently, there is no way to subscribe to a feed of these updates. It would be nice there was a way people could be notified of new posts there without having to check manually. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3213) spark_ec2.py cannot find slave instances
Joseph K. Bradley created SPARK-3213: Summary: spark_ec2.py cannot find slave instances Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109697#comment-14109697 ] Joseph K. Bradley commented on SPARK-3213: -- [~vidaha] Please take a look. Thanks! spark_ec2.py cannot find slave instances Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109700#comment-14109700 ] Joseph K. Bradley commented on SPARK-3213: -- The security group name I was using was joseph-r3.2xlarge-slaves It may be a regex/matching issue. spark_ec2.py cannot find slave instances Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3156) DecisionTree: Order categorical features adaptively
[ https://issues.apache.org/jira/browse/SPARK-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3156: - Assignee: Joseph K. Bradley DecisionTree: Order categorical features adaptively --- Key: SPARK-3156 URL: https://issues.apache.org/jira/browse/SPARK-3156 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Improvement: accuracy Currently, ordered categorical features use a fixed bin ordering chosen before training based on a subsample of the data. (See the code using centroids in findSplitsBins().) Proposal: Choose the ordering adaptively for every split. This would require a bit more computation on the master, but could improve results by splitting more intelligently. Required changes: The result of aggregation is used in findAggForOrderedFeatureClassification() to compute running totals over the pre-set ordering of categorical feature values. The stats should instead be used to choose a new ordering of categories, before computing running totals. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3180) Better control of security groups
[ https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3180. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 2088 [https://github.com/apache/spark/pull/2088] Better control of security groups - Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Fix For: 1.3.0 Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely
Cheng Lian created SPARK-3214: - Summary: Argument parsing loop in make-distribution.sh ends prematurely Key: SPARK-3214 URL: https://issues.apache.org/jira/browse/SPARK-3214 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Minor Running {{make-distribution.sh}} in this way: {code} ./make-distribution.sh --hadoop -Pyarn {code} results in a proper error message: {code} Error: '--hadoop' is no longer supported: Error: use Maven options -Phadoop.version and -Pyarn.version {code} But if you running it with options placed in reverse order, it just passes: {code} ./make-distribution.sh -Pyarn --hadoop {code} The reason is that the {{while}} loop ends prematurely before checking all potentially deprecated command line options. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files
[ https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109735#comment-14109735 ] Tathagata Das commented on SPARK-2798: -- Naah, that was already closed by the fix I did on friday (https://github.com/apache/spark/pull/2101). Maven and therefore make-distribution should work fine with that fix. Correct several small errors in Flume module pom.xml files -- Key: SPARK-2798 URL: https://issues.apache.org/jira/browse/SPARK-2798 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink pom.xml - scalatest is not declared as a test-scope dependency - Its Avro version doesn't match the rest of the build - Its Flume version is not synced with the other Flume module - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version - It depends on Scala Lang directly, which it shouldn't -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely
[ https://issues.apache.org/jira/browse/SPARK-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109765#comment-14109765 ] Cheng Lian commented on SPARK-3214: --- Didn't realize all Maven options must go after other {{make-distribution.sh}} options. Closing this. Argument parsing loop in make-distribution.sh ends prematurely -- Key: SPARK-3214 URL: https://issues.apache.org/jira/browse/SPARK-3214 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Minor Running {{make-distribution.sh}} in this way: {code} ./make-distribution.sh --hadoop -Pyarn {code} results in a proper error message: {code} Error: '--hadoop' is no longer supported: Error: use Maven options -Phadoop.version and -Pyarn.version {code} But if you running it with options placed in reverse order, it just passes: {code} ./make-distribution.sh -Pyarn --hadoop {code} The reason is that the {{while}} loop ends prematurely before checking all potentially deprecated command line options. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely
[ https://issues.apache.org/jira/browse/SPARK-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-3214. - Resolution: Not a Problem Argument parsing loop in make-distribution.sh ends prematurely -- Key: SPARK-3214 URL: https://issues.apache.org/jira/browse/SPARK-3214 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Minor Running {{make-distribution.sh}} in this way: {code} ./make-distribution.sh --hadoop -Pyarn {code} results in a proper error message: {code} Error: '--hadoop' is no longer supported: Error: use Maven options -Phadoop.version and -Pyarn.version {code} But if you running it with options placed in reverse order, it just passes: {code} ./make-distribution.sh -Pyarn --hadoop {code} The reason is that the {{while}} loop ends prematurely before checking all potentially deprecated command line options. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109818#comment-14109818 ] Vida Ha commented on SPARK-3213: Joseph, Josh, I discussed in person. There is a quick workarounds: 1) Use an old version of the spark_ec2 scripts that uses security groups to identify the slaves, if using Launch more like this But now I need to investigate: If using launch more like this, it does seem like amazon tries to reuse the tags, but I'm wondering if it doesn't like having multiple machines with the same Name tag. I will try using a different tag, like spark-ec2-cluster-id or something like that to identify the machine. If that tag does copy over, then we can properly support Launch more like this. spark_ec2.py cannot find slave instances Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109818#comment-14109818 ] Vida Ha edited comment on SPARK-3213 at 8/25/14 9:57 PM: - Joseph, Josh, I discussed in person. There is a quick workaround: 1) Use an old version of the spark_ec2 scripts that uses security groups to identify the slaves, if using Launch more like this 2) Avoid using Launch more like this But now I need to investigate: If using launch more like this, it does seem like amazon tries to reuse the tags, but I'm wondering if it doesn't like having multiple machines with the same Name tag. I will try using a different tag, like spark-ec2-cluster-id or something like that to identify the machine. If that tag does copy over, then we can properly support Launch more like this. was (Author: vidaha): Joseph, Josh, I discussed in person. There is a quick workarounds: 1) Use an old version of the spark_ec2 scripts that uses security groups to identify the slaves, if using Launch more like this But now I need to investigate: If using launch more like this, it does seem like amazon tries to reuse the tags, but I'm wondering if it doesn't like having multiple machines with the same Name tag. I will try using a different tag, like spark-ec2-cluster-id or something like that to identify the machine. If that tag does copy over, then we can properly support Launch more like this. spark_ec2.py cannot find slave instances Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109828#comment-14109828 ] Vida Ha commented on SPARK-3213: Can someone rename this issue to: spark_ec2.py cannot find slave instances launched with Launch More Like This I think that's more indicative of the issue - it's not wider than that. spark_ec2.py cannot find slave instances Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3213: - Summary: spark_ec2.py cannot find slave instances launched with Launch More Like This (was: spark_ec2.py cannot find slave instances) spark_ec2.py cannot find slave instances launched with Launch More Like This -- Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3215) Add remote interface for SparkContext
Marcelo Vanzin created SPARK-3215: - Summary: Add remote interface for SparkContext Key: SPARK-3215 URL: https://issues.apache.org/jira/browse/SPARK-3215 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Marcelo Vanzin A quick description of the issue: as part of running Hive jobs on top of Spark, it's desirable to have a SparkContext that is running in the background and listening for job requests for a particular user session. Running multiple contexts in the same JVM is not a very good solution. Not only SparkContext currently has issues sharing the same JVM among multiple instances, but that turns the JVM running the contexts into a huge bottleneck in the system. So I'm proposing a solution where we have a SparkContext that is running in a separate process, and listening for requests from the client application via some RPC interface (most probably Akka). I'll attach a document shortly with the current proposal. Let's use this bug to discuss the proposal and any other suggestions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3215) Add remote interface for SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-3215: -- Attachment: RemoteSparkContext.pdf Initial proposal for a remote context interface. Note that this is not a formal design document, just a high-level proposal, so it doesn't go deeply into what APIs would be exposed on anything like that. Add remote interface for SparkContext - Key: SPARK-3215 URL: https://issues.apache.org/jira/browse/SPARK-3215 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Marcelo Vanzin Labels: hive Attachments: RemoteSparkContext.pdf A quick description of the issue: as part of running Hive jobs on top of Spark, it's desirable to have a SparkContext that is running in the background and listening for job requests for a particular user session. Running multiple contexts in the same JVM is not a very good solution. Not only SparkContext currently has issues sharing the same JVM among multiple instances, but that turns the JVM running the contexts into a huge bottleneck in the system. So I'm proposing a solution where we have a SparkContext that is running in a separate process, and listening for requests from the client application via some RPC interface (most probably Akka). I'll attach a document shortly with the current proposal. Let's use this bug to discuss the proposal and any other suggestions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3216) Spark-shell is broken for branch-1.0
Andrew Or created SPARK-3216: Summary: Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Priority: Blocker This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. I marked this a blocker because this is completely broken, but it is technically not blocking anything. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
Cheng Lian created SPARK-3217: - Summary: Shaded Guava jar doesn't play well with Maven build Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Blocker PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109947#comment-14109947 ] Apache Spark commented on SPARK-3216: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2122 Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Priority: Blocker This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. I marked this a blocker because this is completely broken, but it is technically not blocking anything. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3189: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-3188 Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) --- Key: SPARK-3189 URL: https://issues.apache.org/jira/browse/SPARK-3189 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces and M-estimator that is more resistant to regression outliers than the Huber M-estimator ()Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3216: - Description: This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. was: This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0 Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Priority: Blocker This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3216: - Description: This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0 was:This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. I marked this a blocker because this is completely broken, but it is technically not blocking anything. Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Priority: Blocker This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3188: - Summary: Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates) (was: Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) ) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates) -- Key: SPARK-3188 URL: https://issues.apache.org/jira/browse/SPARK-3188 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3188: - Description: Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Tukey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). was: Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates) -- Key: SPARK-3188 URL: https://issues.apache.org/jira/browse/SPARK-3188 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Tukey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3204) MaxOf would be foldable if both left and right are foldable.
[ https://issues.apache.org/jira/browse/SPARK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3204. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin MaxOf would be foldable if both left and right are foldable. Key: SPARK-3204 URL: https://issues.apache.org/jira/browse/SPARK-3204 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2929) Rewrite HiveThriftServer2Suite and CliSuite
[ https://issues.apache.org/jira/browse/SPARK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2929. - Resolution: Fixed Fix Version/s: 1.1.0 Rewrite HiveThriftServer2Suite and CliSuite --- Key: SPARK-2929 URL: https://issues.apache.org/jira/browse/SPARK-2929 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.1.0 {{HiveThriftServer2Suite}} and {{CliSuite}} were inherited from Shark and contain too may hard coded timeouts and timing assumptions when doing IPC. This makes these tests both flaky and slow. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3179) Add task OutputMetrics
[ https://issues.apache.org/jira/browse/SPARK-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109997#comment-14109997 ] Michael Yannakopoulos commented on SPARK-3179: -- Hi Sandy, I am willing to help with this issue. I am a new to Apache Spark and I have made few contributions so far. Under your supervision I can work on this issue. Thanks, Michael Add task OutputMetrics -- Key: SPARK-3179 URL: https://issues.apache.org/jira/browse/SPARK-3179 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Track the bytes that tasks write to HDFS or other output destinations. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3061) Maven build fails in Windows OS
[ https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3061: -- Affects Version/s: 1.1.0 Maybe we can use a Maven plugin to unzip? http://stackoverflow.com/questions/3264064/unpack-zip-in-zip-with-maven Maven build fails in Windows OS --- Key: SPARK-3061 URL: https://issues.apache.org/jira/browse/SPARK-3061 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Environment: Windows Reporter: Masayoshi TSUZUKI Priority: Minor Maven build fails in Windows OS with this error message. {noformat} [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default) on project spark-core_2.10: Command execution failed. Cannot run program unzip (in directory C:\path\to\gitofspark\python): CreateProcess error=2, w肳ꂽt@ - [Help 1] {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110026#comment-14110026 ] Yi Tian commented on SPARK-2087: You mean the CACHE TABLE ... AS SELECT... syntax will create temporary table, and could not be found by other session? I'm still confusing about the different between temporary table and cached tables. Clean Multi-user semantics for thrift JDBC/ODBC server. --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor Configuration and temporary tables should exist per-user. Cached tables should be shared across users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
[ https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110035#comment-14110035 ] Marcelo Vanzin commented on SPARK-3217: --- Just did a git clean -dfx on master and rebuilt using maven. This works fine for me. Did you by any chance do one of the following: - forget to clean after pulling that change - mix sbt and mvn built artifacts in the same build - set SPARK_PREPEND_CLASSES I can see any of those causing this issue. I think only the last one is something we need to worry about; we now need to figure out a way to add the guava jar to the classpath when using that option. Shaded Guava jar doesn't play well with Maven build --- Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Reporter: Cheng Lian Priority: Blocker PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
[ https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3217: -- Affects Version/s: (was: 1.0.2) Shaded Guava jar doesn't play well with Maven build --- Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Reporter: Cheng Lian Priority: Blocker PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
[ https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3217: -- Labels: 1.2.0 (was: ) Shaded Guava jar doesn't play well with Maven build --- Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Reporter: Cheng Lian Priority: Blocker Labels: 1.2.0 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
[ https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3217: -- Target Version/s: 1.2.0 (was: 1.1.0) Shaded Guava jar doesn't play well with Maven build --- Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Reporter: Cheng Lian Priority: Blocker Labels: 1.2.0 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3058) Support EXTENDED for EXPLAIN command
[ https://issues.apache.org/jira/browse/SPARK-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3058. - Resolution: Fixed Fix Version/s: 1.1.0 Support EXTENDED for EXPLAIN command Key: SPARK-3058 URL: https://issues.apache.org/jira/browse/SPARK-3058 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Fix For: 1.1.0 Currently, it's no difference when run the command EXPLAIN w or w/o EXTENDED keywords, this patch will show more details of the query plan when EXTENDED keyword provided. {panel:title=EXPLAIN with EXTENDED} explain extended select key as a1, value as a2 from src where key=1; == Parsed Logical Plan == Project ['key AS a1#3,'value AS a2#4] Filter ('key = 1) UnresolvedRelation None, src, None == Analyzed Logical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType)) MetastoreRelation default, src, None == Optimized Logical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = 1.0) MetastoreRelation default, src, None == Physical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = 1.0) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None Code Generation: false == RDD == (2) MappedRDD[14] at map at HiveContext.scala:350 MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42 MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57 MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112 MappedRDD[10] at map at TableReader.scala:240 HadoopRDD[9] at HadoopRDD at TableReader.scala:230 {panel} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
[ https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3217: --- Labels: (was: 1.2.0) Shaded Guava jar doesn't play well with Maven build --- Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Cheng Lian Priority: Blocker PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
[ https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3217: --- Affects Version/s: 1.2.0 Shaded Guava jar doesn't play well with Maven build --- Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Cheng Lian Priority: Blocker PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
[ https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110116#comment-14110116 ] Helena Edelson commented on SPARK-3178: --- +1 it doesn't look like the input data is validated to fail fast if mb/g is not noted setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero Key: SPARK-3178 URL: https://issues.apache.org/jira/browse/SPARK-3178 Project: Spark Issue Type: Bug Environment: osx Reporter: Jon Haddad This should either default to m or just completely fail. Starting a worker with zero memory isn't very helpful. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3218) K-Means clusterer can fail on degenerate data
Derrick Burns created SPARK-3218: Summary: K-Means clusterer can fail on degenerate data Key: SPARK-3218 URL: https://issues.apache.org/jira/browse/SPARK-3218 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail. Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center. The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation. I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3193) output errer info when Process exitcode not zero
[ https://issues.apache.org/jira/browse/SPARK-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei reopened SPARK-3193: output errer info when Process exitcode not zero Key: SPARK-3193 URL: https://issues.apache.org/jira/browse/SPARK-3193 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: wangfei I noticed that sometimes pr tests failed due to the Process exitcode != 0: DriverSuite: Spark assembly has been built with Hive, including Datanucleus jars on classpath - driver should exit after finishing *** FAILED *** SparkException was thrown during property evaluation. (DriverSuite.scala:40) Message: Process List(./bin/spark-class, org.apache.spark.DriverWithoutCleanup, local) exited with code 1 Occurred at table row 0 (zero based, not counting headings), which had values ( master = local ) [info] SparkSubmitSuite: [info] - prints usage on empty input [info] - prints usage with only --help [info] - prints error with unrecognized options [info] - handle binary specified but not class [info] - handles arguments with --key=val [info] - handles arguments to user program [info] - handles arguments to user program with name collision [info] - handles YARN cluster mode [info] - handles YARN client mode [info] - handles standalone cluster mode [info] - handles standalone client mode [info] - handles mesos client mode [info] - handles confs with flag equivalents [info] - launch simple application with spark-submit *** FAILED *** [info] org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited with code 1 [info] at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872) [info] at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) [info] at org.apacSpark assembly has been built with Hive, including Datanucleus jars on classpath refer to https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull we should output the process error info when failed, this can be helpful for diagnosis. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2921) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)
[ https://issues.apache.org/jira/browse/SPARK-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110127#comment-14110127 ] Cheng Lian commented on SPARK-2921: --- [~andrewor14] {{spark.executor.extraLibraryPath}} is affected. But {{spark.executor.extraClassPath}} should be OK since it's finally added to the environment variable {{SPARK_CLASSPATH}}. Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things) --- Key: SPARK-2921 URL: https://issues.apache.org/jira/browse/SPARK-2921 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.2 Reporter: Andrew Or Priority: Blocker Fix For: 1.1.0 The code path to handle this exists only for the coarse grained mode, and even in this mode the java options aren't passed to the executors properly. We currently pass the entire value of spark.executor.extraJavaOptions to the executors as a string without splitting it. We need to use Utils.splitCommandString as in standalone mode. I have not confirmed this, but I would assume spark.executor.extraClassPath and spark.executor.extraLibraryPath are also not propagated correctly in either mode. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3219) K-Means clusterer should support Bregman distance metrics
Derrick Burns created SPARK-3219: Summary: K-Means clusterer should support Bregman distance metrics Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3220) K-Means clusterer should perform K-Means initialization in parallel
Derrick Burns created SPARK-3220: Summary: K-Means clusterer should perform K-Means initialization in parallel Key: SPARK-3220 URL: https://issues.apache.org/jira/browse/SPARK-3220 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns The LocalKMeans method should be replaced with a parallel implementation. As it stands now, it becomes a bottleneck for large data sets. I have implemented this functionality in my version of the clusterer. However, I see that there are hundreds of outstanding pull requests. If someone on the team wants to sponsor the pull request, I will create one. Otherwise, I will just maintain my own private fork of the clusterer. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build
[ https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110142#comment-14110142 ] Cheng Lian commented on SPARK-3217: --- [~vanzin] Thanks, I did set {{SPARK_PREPEND_CLASSES}}. Will change the title and description of this issue after verifying it. Shaded Guava jar doesn't play well with Maven build --- Key: SPARK-3217 URL: https://issues.apache.org/jira/browse/SPARK-3217 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Cheng Lian Priority: Blocker PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and moved Guava classes to package {{org.spark-project.guava}} when Spark is built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}. The result is that, when Spark is built with Maven (or {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws {{ClassNotFoundException}}: {code} # Build Spark with Maven $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests ... # Then spark-shell complains $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:636) at org.apache.spark.util.Utils$.clinit(Utils.scala) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134) at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65) at org.apache.spark.repl.Main$.main(Main.scala:30) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuilder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more # Check the assembly jar file $ jar tf assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep -i ThreadFactoryBuilder org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class {code} SBT build is fine since we don't shade Guava with SBT right now (and that's why Jenkins didn't complain about this). Possible solutions can be: # revert PR #1813 for safe, or # also shade Guava in SBT build and only use {{org.spark-project.guava}} in Spark -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110149#comment-14110149 ] Vida Ha edited comment on SPARK-3213 at 8/26/14 1:49 AM: - Hi Joseph, Can you tell me more about how you launched these, without copying the tags? I used Launch More Like This, and the name and tags were copied over correctly. I'm wondering if maybe when you were using EC2, if perhaps you could have been so unlucky as to have trigger a temporary outage in copying tags... Let's sync up in person tomorrow and figure out if this was a one time problem or happens each time Launch More Like This is used. was (Author: vidaha): Hi Joseph, Can you tell me more about how you launched these, without copying the tags? I used Launch More Like This, and the name and tags were copied over correctly. I'm wondering if maybe when you were using EC2, if perhaps you could have been so unlucky as to have trigger a temporary outage in copying tags... Let's sync up in person tomorrow and figure out if this was a one time problem or happens each time Launch spark_ec2.py cannot find slave instances launched with Launch More Like This -- Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110149#comment-14110149 ] Vida Ha commented on SPARK-3213: Hi Joseph, Can you tell me more about how you launched these, without copying the tags? I used Launch More Like This, and the name and tags were copied over correctly. I'm wondering if maybe when you were using EC2, if perhaps you could have been so unlucky as to have trigger a temporary outage in copying tags... Let's sync up in person tomorrow and figure out if this was a one time problem or happens each time Launch spark_ec2.py cannot find slave instances launched with Launch More Like This -- Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vida Ha updated SPARK-3213: --- Attachment: Screen Shot 2014-08-25 at 6.45.35 PM.png spark_ec2.py cannot find slave instances launched with Launch More Like This -- Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This
[ https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110149#comment-14110149 ] Vida Ha edited comment on SPARK-3213 at 8/26/14 1:51 AM: - Hi Joseph, Can you tell me more about how you launched these, without copying the tags? I used Launch More Like This, and the name and tags were copied over correctly - see my screenshot above. I'm wondering if maybe when you were using EC2, if perhaps you could have been so unlucky as to have trigger a temporary outage in copying tags... Let's sync up in person tomorrow and figure out if this was a one time problem or happens each time Launch More Like This is used or perhaps if we used different ways to launch more slaves. was (Author: vidaha): Hi Joseph, Can you tell me more about how you launched these, without copying the tags? I used Launch More Like This, and the name and tags were copied over correctly. I'm wondering if maybe when you were using EC2, if perhaps you could have been so unlucky as to have trigger a temporary outage in copying tags... Let's sync up in person tomorrow and figure out if this was a one time problem or happens each time Launch More Like This is used. spark_ec2.py cannot find slave instances launched with Launch More Like This -- Key: SPARK-3213 URL: https://issues.apache.org/jira/browse/SPARK-3213 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Blocker Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png spark_ec2.py cannot find all slave instances. In particular: * I created a master slave and configured them. * I created new slave instances from the original slave (Launch More Like This). * I tried to relaunch the cluster, and it could only find the original slave. Old versions of the script worked. The latest working commit which edited that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1 There may be a problem with this PR: [https://github.com/apache/spark/pull/1899]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org