[jira] [Resolved] (SPARK-5366) check for mode of private key file
[ https://issues.apache.org/jira/browse/SPARK-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5366. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4162 [https://github.com/apache/spark/pull/4162] check for mode of private key file -- Key: SPARK-5366 URL: https://issues.apache.org/jira/browse/SPARK-5366 Project: Spark Issue Type: Improvement Components: EC2 Reporter: liu chang Priority: Minor Fix For: 1.4.0 check the mode for the private key. User should set it to 600. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5656) NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k
[ https://issues.apache.org/jira/browse/SPARK-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5656. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4433 [https://github.com/apache/spark/pull/4433] NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k -- Key: SPARK-5656 URL: https://issues.apache.org/jira/browse/SPARK-5656 Project: Spark Issue Type: Bug Components: MLlib Reporter: Mark Bittmann Priority: Minor Fix For: 1.4.0 Large values of n or k in EigenValueDecomposition.symmetricEigs will fail with a NegativeArraySizeException. Specifically, this occurs when 2*n*k Integer.MAX_VALUE. These values are currently unchecked and allow for the array to be initialized to a value greater than Integer.MAX_VALUE. I have written the below 'require' to fail this condition gracefully. I will submit a pull request. require(ncv * n.toLong Integer.MAX_VALUE, Product of 2*k*n must be smaller than + sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix dimension n = $n) Here is the exception that occurs from computeSVD with large k and/or n: Exception in thread main java.lang.NegativeArraySizeException at org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5672) Don't return `ERROR 500` when have missing args
[ https://issues.apache.org/jira/browse/SPARK-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5672. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4239 [https://github.com/apache/spark/pull/4239] Don't return `ERROR 500` when have missing args --- Key: SPARK-5672 URL: https://issues.apache.org/jira/browse/SPARK-5672 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kirill A. Korinskiy Fix For: 1.3.0 Spark web UI return HTTP ERROR 500 when GET arguments is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5672) Don't return `ERROR 500` when have missing args
[ https://issues.apache.org/jira/browse/SPARK-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5672: - Priority: Minor (was: Major) Assignee: Kirill A. Korinskiy Don't return `ERROR 500` when have missing args --- Key: SPARK-5672 URL: https://issues.apache.org/jira/browse/SPARK-5672 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kirill A. Korinskiy Assignee: Kirill A. Korinskiy Priority: Minor Fix For: 1.3.0 Spark web UI return HTTP ERROR 500 when GET arguments is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5673) Implement Streaming wrapper for all linear methos
[ https://issues.apache.org/jira/browse/SPARK-5673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311163#comment-14311163 ] Apache Spark commented on SPARK-5673: - User 'catap' has created a pull request for this issue: https://github.com/apache/spark/pull/4456 Implement Streaming wrapper for all linear methos - Key: SPARK-5673 URL: https://issues.apache.org/jira/browse/SPARK-5673 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Kirill A. Korinskiy Now spark had streaming wrapper for Logistic and Linear regressions only. So, implement wrapper for SVM, Lasso and Ridge Regression will make streaming fashion more useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5673) Implement Streaming wrapper for all linear methos
[ https://issues.apache.org/jira/browse/SPARK-5673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill A. Korinskiy updated SPARK-5673: --- Component/s: MLlib Implement Streaming wrapper for all linear methos - Key: SPARK-5673 URL: https://issues.apache.org/jira/browse/SPARK-5673 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Kirill A. Korinskiy Now spark had streaming wrapper for Logistic and Linear regressions only. So, implement wrapper for SVM, Lasso and Ridge Regression will make streaming fashion more useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize Scala/Java test execution
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311270#comment-14311270 ] Sean Owen commented on SPARK-3431: -- Haven't tried anything recently, no. Parallelize Scala/Java test execution - Key: SPARK-3431 URL: https://issues.apache.org/jira/browse/SPARK-3431 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: SPARK-3431-srowen-attempt.patch Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common strategy to cut test time down is to parallelize the execution of the tests. Doing that may in turn require some prerequisite changes to be made to how certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311272#comment-14311272 ] Sean Owen commented on SPARK-5625: -- I think you may be running into problems with older version of zip that can't uncompress a zip file with more than 65535 files. Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-761: Description: Not sure what component this falls under, or if this is still an issue. Patrick Wendell / Matei Zaharia? was:I don't know how possible this is, as incompatibilities manifest in many and low-level ways. I don't know how possible this is, as incompatibilities manifest in many and low-level ways. Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Not sure what component this falls under, or if this is still an issue. Patrick Wendell / Matei Zaharia? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5140: - Component/s: Spark Core Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Corey J. Nolet Labels: features Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5334) NullPointerException when getting files from S3 (hadoop 2.3+)
[ https://issues.apache.org/jira/browse/SPARK-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5334: - Component/s: Input/Output Related to, or resolved by, SPARK-5671? NullPointerException when getting files from S3 (hadoop 2.3+) - Key: SPARK-5334 URL: https://issues.apache.org/jira/browse/SPARK-5334 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0 Environment: Spark 1.2 built with Hadoop 2.3+ Reporter: Kevin (Sangwoo) Kim In Spark 1.2 built with Hadoop 2.3+, unable to get files from AWS S3. Same codes works well with same setup in Spark built with Hadoop 2.2-. I saw that jets3t version changed in profile with Hadoop 2.3+, I guess there might be an issue with it. === scala sc.textFile(s3n://logs/log.2014-12-05.gz).count 15/01/20 11:22:40 INFO MemoryStore: ensureFreeSpace(104533) called with curMem=0, maxMem=27783541555 15/01/20 11:22:40 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 102.1 KB, free 25.9 GB) java.lang.NullPointerException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57) at org.apache.hadoop.fs.Globber.glob(Globber.java:248) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1157) at org.apache.spark.rdd.RDD.count(RDD.scala:904) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Updated] (SPARK-4229) Create hadoop configuration in a consistent way
[ https://issues.apache.org/jira/browse/SPARK-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4229: - Component/s: Spark Core Create hadoop configuration in a consistent way --- Key: SPARK-4229 URL: https://issues.apache.org/jira/browse/SPARK-4229 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Cody Koeninger Priority: Minor Some places use SparkHadoopUtil.get.conf, some create a new hadoop config. Prefer SparkHadoopUtil so that spark.hadoop.* properties are pulled in. http://apache-spark-developers-list.1001551.n3.nabble.com/Hadoop-configuration-for-checkpointing-td9084.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4540) Improve Executor ID Logging
[ https://issues.apache.org/jira/browse/SPARK-4540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4540: - Component/s: Spark Core Improve Executor ID Logging --- Key: SPARK-4540 URL: https://issues.apache.org/jira/browse/SPARK-4540 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Arun Ahuja Priority: Minor A few things that would useful here: - An executor should log what executor it is running, AFAICT this does not help and only the driver reports that executor 10 is running on xyz.host.com - For YARN, when an executor fails in addition to reporting the executor ID of the lost executor, report the container ID as well The latter is useful for multiple executors running on the same machine where it may be more useful to find the container directly than the executor ID or host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4321) Make Kryo serialization work for closures
[ https://issues.apache.org/jira/browse/SPARK-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4321: - Component/s: Spark Core Make Kryo serialization work for closures - Key: SPARK-4321 URL: https://issues.apache.org/jira/browse/SPARK-4321 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jeff Hammerbacher -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip
[ https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4563: - Component/s: Deploy Allow spark driver to bind to different ip then advertise ip Key: SPARK-4563 URL: https://issues.apache.org/jira/browse/SPARK-4563 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Long Nguyen Priority: Minor Spark driver bind ip and advertise is not configurable. spark.driver.host is only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option to set advertised ip/hostname -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5412) Cannot bind Master to a specific hostname as per the documentation
[ https://issues.apache.org/jira/browse/SPARK-5412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5412: - Component/s: Deploy Cannot bind Master to a specific hostname as per the documentation -- Key: SPARK-5412 URL: https://issues.apache.org/jira/browse/SPARK-5412 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.0 Reporter: Alexis Seigneurin Documentation on http://spark.apache.org/docs/latest/spark-standalone.html indicates: {quote} You can start a standalone master server by executing: ./sbin/start-master.sh ... the following configuration options can be passed to the master and worker: ... -h HOST, --host HOST Hostname to listen on {quote} The \-h or --host parameter actually doesn't work with the start-master.sh script. Instead, one has to set the SPARK_MASTER_IP variable prior to executing the script. Either the script or the documentation should be updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task
[ https://issues.apache.org/jira/browse/SPARK-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5360: - Component/s: Spark Core For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task Key: SPARK-5360 URL: https://issues.apache.org/jira/browse/SPARK-5360 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object. This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here. I did some simple experiments to see how much this effects closure size. For this example: $ val a = sc.parallelize(1 to 10).map((_, 1)) $ val b = sc.parallelize(1 to 2).map(x = (x, 2*x)) $ a.cogroup(b).collect() the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle. For this example: $ val sortedA = a.sortByKey() $ val sortedB = b.sortByKey() $ sortedA.cogroup(sortedB).collect() the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies. The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). However, the difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it
[ https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5423: - Component/s: Shuffle ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it --- Key: SPARK-5423 URL: https://issues.apache.org/jira/browse/SPARK-5423 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Shixiong Zhu Priority: Minor ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it. There is already a TODO in the comment: {code} // TODO: Ensure this gets called even if the iterator isn't drained. private def cleanup() { batchIndex = batchOffsets.length // Prevent reading any other batch val ds = deserializeStream deserializeStream = null fileStream = null ds.close() file.delete() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3132: - Component/s: Spark Core Avoid serialization for Array[Byte] in TorrentBroadcast --- Key: SPARK-3132 URL: https://issues.apache.org/jira/browse/SPARK-3132 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Davies Liu If the input data is a byte array, we should allow TorrentBroadcast to skip serializing and compressing the input. To do this, we should add a new parameter (shortCircuitByteArray) to TorrentBroadcast, and then avoid serialization in if the input is byte array and shortCircuitByteArray is true. We should then also do compression in task serialization itself instead of doing it in TorrentBroadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced
[ https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1866: - Component/s: Spark Core Closure cleaner does not null shadowed fields when outer scope is referenced Key: SPARK-1866 URL: https://issues.apache.org/jira/browse/SPARK-1866 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Kan Zhang Priority: Critical Take the following example: {code} val x = 5 val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */ sc.parallelize(0 until 10).map { _ = val instances = 3 (instances, x) }.collect {code} This produces a java.io.NotSerializableException: org.apache.hadoop.fs.Path, despite the fact that the outer instances is not actually used within the closure. If you change the name of the outer variable instances to something else, the code executes correctly, indicating that it is the fact that the two variables share a name that causes the issue. Additionally, if the outer scope is not used (i.e., we do not reference x in the above example), the issue does not appear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3039. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4315 [https://github.com/apache/spark/pull/4315] Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy Assignee: Bertrand Bossy Priority: Critical Fix For: 1.3.0 The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed
[ https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311264#comment-14311264 ] Apache Spark commented on SPARK-5668: - User 'MiguelPeralvo' has created a pull request for this issue: https://github.com/apache/spark/pull/4457 spark_ec2.py region parameter could be either mandatory or its value displayed -- Key: SPARK-5668 URL: https://issues.apache.org/jira/browse/SPARK-5668 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Miguel Peralvo Priority: Minor Labels: starter If the region parameter is not specified when invoking spark-ec2 (spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster doesn't belong to that region, after showing the Searching for existing cluster Spark... message, it causes an ERROR: Could not find any existing cluster exception because it doesn't find you cluster in the default region. As it doesn't tell you anything about the region, It can be a small headache for new users. In http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster, Dmitriy Selivanov explains it. I propose that: 1. Either we make the search message a little bit more informative with something like Searching for existing cluster Spark in region + opts.region. 2. Or we remove the us-east-1 as default and make the --region parameter mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4440) Enhance the job progress API to expose more information
[ https://issues.apache.org/jira/browse/SPARK-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4440: - Component/s: Spark Core Enhance the job progress API to expose more information --- Key: SPARK-4440 URL: https://issues.apache.org/jira/browse/SPARK-4440 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Rui Li The progress API introduced in SPARK-2321 provides a new way for user to monitor job progress. However the information exposed in the API is relatively limited. It'll be much more useful if we can enhance the API to expose more data. Some improvement for example may include but not limited to: 1. Stage submission and completion time. 2. Task metrics. The requirement is initially identified for the hive on spark project(HIVE-7292), other application should benefit as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-761: Component/s: Spark Core Description: I don't know how possible this is, as incompatibilities manifest in many and low-level ways. Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia I don't know how possible this is, as incompatibilities manifest in many and low-level ways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-672) Executor gets stuck in a zombie state after running out of memory
[ https://issues.apache.org/jira/browse/SPARK-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-672: Component/s: Spark Core Executor gets stuck in a zombie state after running out of memory --- Key: SPARK-672 URL: https://issues.apache.org/jira/browse/SPARK-672 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Mikhail Bautin Attachments: executor_jstack.txt, executor_stderr.txt, standalone_worker_jstack.txt As a result of running a workload, an executor ran out of memory, but the executor process stayed up. Also (not sure this is related) the standalone worker process stayed up but disappeared from the master web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-761: Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Not sure what component this falls under, or if this is still an issue. Patrick Wendell / Matei Zaharia? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-672) Executor gets stuck in a zombie state after running out of memory
[ https://issues.apache.org/jira/browse/SPARK-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-672. - Resolution: Duplicate The right-er answer is to fail for lack of memory faster, per SPARK-1989. Executor gets stuck in a zombie state after running out of memory --- Key: SPARK-672 URL: https://issues.apache.org/jira/browse/SPARK-672 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Mikhail Bautin Attachments: executor_jstack.txt, executor_stderr.txt, standalone_worker_jstack.txt As a result of running a workload, an executor ran out of memory, but the executor process stayed up. Also (not sure this is related) the standalone worker process stayed up but disappeared from the master web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections
[ https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-704: Component/s: Spark Core ConnectionManager sometimes cannot detect loss of sending connections - Key: SPARK-704 URL: https://issues.apache.org/jira/browse/SPARK-704 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Charles Reiss Assignee: Henry Saputra ConnectionManager currently does not detect when SendingConnections disconnect except if it is trying to send through them. As a result, a node failure just after a connection is initiated but before any acknowledgement messages can be sent may result in a hang. ConnectionManager has code intended to detect this case by detecting the failure of a corresponding ReceivingConnection, but this code assumes that the remote host:port of the ReceivingConnection is the same as the ConnectionManagerId, which is almost never true. Additionally, there does not appear to be any reason to assume a corresponding ReceivingConnection will exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5065) BroadCast can still work after sc had been stopped.
[ https://issues.apache.org/jira/browse/SPARK-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5065: - Component/s: Spark Core Priority: Minor (was: Major) BroadCast can still work after sc had been stopped. --- Key: SPARK-5065 URL: https://issues.apache.org/jira/browse/SPARK-5065 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: SaintBacchus Priority: Minor Code as follow: {code:borderStyle=solid} val sc1 = new SparkContext val sc2 = new SparkContext sc1.stop sc1.broadcast(1) {code} It can work well, because sc1.broadcast will reuse the BlockManager in sc2. To fix it, throw a sparkException when broadCastManager had stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5332) Efficient way to deal with ExecutorLost
[ https://issues.apache.org/jira/browse/SPARK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5332: - Component/s: Spark Core Priority: Minor (was: Major) Efficient way to deal with ExecutorLost --- Key: SPARK-5332 URL: https://issues.apache.org/jira/browse/SPARK-5332 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Liang-Chi Hsieh Priority: Minor Currently, the handler of the case when an executor being lost in DAGScheduler (handleExecutorLost) looks not efficient. This pr tries to add a bit of extra information to Stage class to improve that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4895) Support a shared RDD store among different Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4895. -- Resolution: Duplicate Since I don't see additional work here, and this covers almost exactly the same ground as SPARK-2389, and I don't imagine Spark Core will do anything to share RDDs that Tachyon isn't already providing, I think this should be closed. Support a shared RDD store among different Spark contexts - Key: SPARK-4895 URL: https://issues.apache.org/jira/browse/SPARK-4895 Project: Spark Issue Type: New Feature Reporter: Zane Hu It seems a valid requirement to allow jobs from different Spark contexts to share RDDs. It would be limited if we only allow sharing RDDs within a SparkContext, as in Ooyala (SPARK-818). A more generic way for collaboration among jobs from different Spark contexts is to support a shared RDD store managed by a RDD store master and workers running in separate processes from SparkContext and executor JVMs. This shared RDD store doesn't do any RDD transformations, but accepts requests from jobs of different Spark contexts to read and write shared RDDs in memory or on disks on distributed machines, and manages the life cycle of these RDDs. Tachyon might be used for sharing data in this case. But I think Tachyon is more designed as an in-memory distributed file system for any applications, not only for RDDs and Spark. If people agree, I may draft out a design document for further discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4681) Turn on host level blacklisting by default
[ https://issues.apache.org/jira/browse/SPARK-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4681: - Component/s: Scheduler Turn on host level blacklisting by default -- Key: SPARK-4681 URL: https://issues.apache.org/jira/browse/SPARK-4681 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Patrick Wendell Assignee: Davies Liu Per discussion in https://github.com/apache/spark/pull/3541. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5311) EventLoggingListener throws exception if log directory does not exist
[ https://issues.apache.org/jira/browse/SPARK-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5311: - Component/s: Spark Core EventLoggingListener throws exception if log directory does not exist - Key: SPARK-5311 URL: https://issues.apache.org/jira/browse/SPARK-5311 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Josh Rosen Priority: Blocker If the log directory does not exist, EventLoggingListener throws an IllegalArgumentException. Here's a simple reproduction (using the master branch (1.3.0)): {code} ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir {code} where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp exists. This results in the following exception: {code} 15/01/18 17:10:44 INFO HttpServer: Starting HTTP Server 15/01/18 17:10:44 INFO Utils: Successfully started service 'HTTP file server' on port 62729. 15/01/18 17:10:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 15/01/18 17:10:44 INFO Utils: Successfully started service 'SparkUI' on port 4041. 15/01/18 17:10:44 INFO SparkUI: Started SparkUI at http://joshs-mbp.att.net:4041 15/01/18 17:10:45 INFO Executor: Using REPL class URI: http://192.168.1.248:62726 15/01/18 17:10:45 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkdri...@joshs-mbp.att.net:62728/user/HeartbeatReceiver 15/01/18 17:10:45 INFO NettyBlockTransferService: Server created on 62730 15/01/18 17:10:45 INFO BlockManagerMaster: Trying to register BlockManager 15/01/18 17:10:45 INFO BlockManagerMasterActor: Registering block manager localhost:62730 with 265.4 MB RAM, BlockManagerId(driver, localhost, 62730) 15/01/18 17:10:45 INFO BlockManagerMaster: Registered BlockManager java.lang.IllegalArgumentException: Log directory /tmp/nonexistent-dir does not exist. at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:90) at org.apache.spark.SparkContext.init(SparkContext.scala:363) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at
[jira] [Updated] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4783: - Component/s: Spark Core System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4723) To abort the stages which have attempted some times
[ https://issues.apache.org/jira/browse/SPARK-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4723: - Component/s: Scheduler To abort the stages which have attempted some times --- Key: SPARK-4723 URL: https://issues.apache.org/jira/browse/SPARK-4723 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: YanTang Zhai Priority: Minor For some reason, some stages may attempt many times. A threshold could be added and the stages which have attempted more than the threshold could be aborted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1346) Backport SPARK-1210 into 0.9 branch
[ https://issues.apache.org/jira/browse/SPARK-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1346: - Component/s: Spark Core Backport SPARK-1210 into 0.9 branch --- Key: SPARK-1346 URL: https://issues.apache.org/jira/browse/SPARK-1346 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Tathagata Das Labels: backport-needed We should backport this after the 0.9.1 release happens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2958) FileClientHandler should not be shared in the pipeline
[ https://issues.apache.org/jira/browse/SPARK-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2958: - Component/s: Spark Core FileClientHandler should not be shared in the pipeline -- Key: SPARK-2958 URL: https://issues.apache.org/jira/browse/SPARK-2958 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Netty module creates a single FileClientHandler and shares it in all threads. We should create a new one for each pipeline thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-839) Bug in how failed executors are removed by ID from standalone cluster
[ https://issues.apache.org/jira/browse/SPARK-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-839: Component/s: Spark Core Bug in how failed executors are removed by ID from standalone cluster - Key: SPARK-839 URL: https://issues.apache.org/jira/browse/SPARK-839 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.0, 0.7.3 Reporter: Mark Hamstra Priority: Critical ClearStory data reported the following issue, where some hashmaps are indexed by executorId and some by appId/executorId, and we use the wrong string to search for an executor: https://github.com/clearstorydata/spark/pull/9. This affects FT on the standalone mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3965) Spark assembly for hadoop2 contains avro-mapred for hadoop1
[ https://issues.apache.org/jira/browse/SPARK-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3965. -- Resolution: Duplicate Spark assembly for hadoop2 contains avro-mapred for hadoop1 --- Key: SPARK-3965 URL: https://issues.apache.org/jira/browse/SPARK-3965 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.0.2, 1.1.0, 1.2.0 Environment: hadoop2, HDP2.1 Reporter: David Jacot When building Spark assembly for hadoop2, org.apache.avro:avro-mapred for hadoop1 is picked and added to the assembly which leads to following exception at runtime. {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) ... {code} The patch for SPARK-3039 works well at compile time but artefact's classifier is not applied when assembly is built. I'm not a maven expert but I don't think that classifiers are applied on transitive dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5334) NullPointerException when getting files from S3 (hadoop 2.3+)
[ https://issues.apache.org/jira/browse/SPARK-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311324#comment-14311324 ] Kevin (Sangwoo) Kim commented on SPARK-5334: [~srowen] Oh thanks! I'll test it. NullPointerException when getting files from S3 (hadoop 2.3+) - Key: SPARK-5334 URL: https://issues.apache.org/jira/browse/SPARK-5334 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0 Environment: Spark 1.2 built with Hadoop 2.3+ Reporter: Kevin (Sangwoo) Kim In Spark 1.2 built with Hadoop 2.3+, unable to get files from AWS S3. Same codes works well with same setup in Spark built with Hadoop 2.2-. I saw that jets3t version changed in profile with Hadoop 2.3+, I guess there might be an issue with it. === scala sc.textFile(s3n://logs/log.2014-12-05.gz).count 15/01/20 11:22:40 INFO MemoryStore: ensureFreeSpace(104533) called with curMem=0, maxMem=27783541555 15/01/20 11:22:40 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 102.1 KB, free 25.9 GB) java.lang.NullPointerException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57) at org.apache.hadoop.fs.Globber.glob(Globber.java:248) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1157) at org.apache.spark.rdd.RDD.count(RDD.scala:904) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Updated] (SPARK-5225) Support coalesed Input Metrics from different sources
[ https://issues.apache.org/jira/browse/SPARK-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5225: - Component/s: Spark Core Support coalesed Input Metrics from different sources - Key: SPARK-5225 URL: https://issues.apache.org/jira/browse/SPARK-5225 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Kostas Sakellis Currently, If task reads data from more than one block and it is from different read methods we ignore the second read method bytes. For example: {noformat} CoalescedRDD | Task1 / | \ hadoop hadoop cached {noformat} if Task1 starts reading from the hadoop blocks first, then the input metrics for Task1 will only contain input metrics from the hadoop blocks and ignre the input metrics from cached blocks. We need to change the way we collect input metrics so that it is not a single value but rather a collection of input metrics for a task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2666: - Component/s: Spark Core when task is FetchFailed cancel running tasks of failedStage Key: SPARK-2666 URL: https://issues.apache.org/jira/browse/SPARK-2666 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Lianhui Wang in DAGScheduler's handleTaskCompletion,when reason of failed task is FetchFailed, cancel running tasks of failedStage before add failedStage to failedStages queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4087) Only use broadcast for large tasks
[ https://issues.apache.org/jira/browse/SPARK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4087: - Component/s: Spark Core Only use broadcast for large tasks -- Key: SPARK-4087 URL: https://issues.apache.org/jira/browse/SPARK-4087 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Priority: Critical After we broadcast every tasks, some regressions are introduced because of broadcast is not stable enough. So we would like to only use broadcast for large tasks, which will keep the same behaviour as 1.0 for most of the cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4059) spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST
[ https://issues.apache.org/jira/browse/SPARK-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4059: - Priority: Minor (was: Major) It's not clear what this means? spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST -- Key: SPARK-4059 URL: https://issues.apache.org/jira/browse/SPARK-4059 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Guo Ruijing Priority: Minor spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST existing implemetation: spark-maste uses SPARK_MASTER_IP and spark-worker uses STANDALONE_SPARK_MASTER_HOST. proposal implementatioin: spark-master/spark-worker may use SPARK_MASTER_IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4059) spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST
[ https://issues.apache.org/jira/browse/SPARK-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4059: - Component/s: Deploy spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST -- Key: SPARK-4059 URL: https://issues.apache.org/jira/browse/SPARK-4059 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Guo Ruijing Priority: Minor spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST existing implemetation: spark-maste uses SPARK_MASTER_IP and spark-worker uses STANDALONE_SPARK_MASTER_HOST. proposal implementatioin: spark-master/spark-worker may use SPARK_MASTER_IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4539) History Server counts incomplete applications against the retainedApplications total, fails to show eligible completed applications
[ https://issues.apache.org/jira/browse/SPARK-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4539: - Component/s: Spark Core History Server counts incomplete applications against the retainedApplications total, fails to show eligible completed applications - Key: SPARK-4539 URL: https://issues.apache.org/jira/browse/SPARK-4539 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams I have observed the history server to return 0 or 1 applications from a directory that contains many complete and incomplete applications (the latter being application directories that are missing the {{APPLICATION_COMPLETE}} file). Without having dug too much, my theory is that HistoryServer is seeing the incomplete directories and counting them against the {{retainedApplications}} maximum but not displaying them. One supporting anecdote for this is that I loaded HS against a directory that had one complete application and nothing else, and HS worked as expected (I saw the one application in the web UI). I then copied ~100 other application directories in, the majority of which were incomplete (in particular, most of the ones that had the earliest timestamps), and still only saw the one original completed application via the web UI. Finally, I restarted the same server with the {{retainedApplications}} set to 1000 (instead of 50; the directory a this point had ~10 completed applications and 90 incomplete ones), and saw all/exactly the completed applications, leading me to believe that they were being boxed out of the maximum-50-retained-applications iteration of the history server. Silently failing on incomplete directories while still docking the count, if that is indeed what is happening, is a pretty confusing failure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4560) Lambda deserialization error
[ https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4560: - Component/s: Spark Core Lambda deserialization error Key: SPARK-4560 URL: https://issues.apache.org/jira/browse/SPARK-4560 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.1.1 Environment: Java 8.0.25 Reporter: Alexis Seigneurin Attachments: IndexTweets.java, pom.xml I'm getting an error saying a lambda could not be deserialized. Here is the code: {code} TwitterUtils.createStream(sc, twitterAuth, filters) .map(t - t.getText()) .foreachRDD(tweets - { tweets.foreach(x - System.out.println(x)); return null; }); {code} Here is the exception: {noformat} java.io.IOException: unexpected exception type at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104) ... 27 more Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization at com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1) ... 37 more {noformat} The weird thing is, if I write the following code (the map operation is inside the foreachRDD), it works without problem. {code} TwitterUtils.createStream(sc, twitterAuth, filters) .foreachRDD(tweets - { tweets.map(t - t.getText()) .foreach(x - System.out.println(x)); return null; }); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5594: - Component/s: Spark Core SparkException: Failed to get broadcast (TorrentBroadcast) -- Key: SPARK-5594 URL: https://issues.apache.org/jira/browse/SPARK-5594 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: John Sandiford Priority: Critical I am uncertain whether this is a bug, however I am getting the error below when running on a cluster (works locally), and have no idea what is causing it, or where to look for more information. Any help is appreciated. Others appear to experience the same issue, but I have not found any solutions online. Please note that this only happens with certain code and is repeatable, all my other spark jobs work fine. ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: Lost task 3.3 in stage 6.0 (TID 24, removed): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6 at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) ... 11 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
[jira] [Updated] (SPARK-2319) Number of tasks on executors become negative after executor failures
[ https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2319: - Component/s: Web UI Number of tasks on executors become negative after executor failures Key: SPARK-2319 URL: https://issues.apache.org/jira/browse/SPARK-2319 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Andrew Or Attachments: num active tasks become negative (-16).jpg See attached screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2520) the executor is thrown java.io.StreamCorruptedException
[ https://issues.apache.org/jira/browse/SPARK-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2520: - Component/s: Shuffle the executor is thrown java.io.StreamCorruptedException --- Key: SPARK-2520 URL: https://issues.apache.org/jira/browse/SPARK-2520 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.0.0 Reporter: Guoqiang Li Priority: Critical This issue occurs with a very small probability. I can not reproduce it. The executor log: {code} 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:34429 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:31934 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:30557 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:42606 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:37314 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:166 as TID 4948 on executor 20: tuan221 (PROCESS_LOCAL) 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:166 as 3129 bytes in 1 ms 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4868 (task 0.0:86) 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Loss was due to java.io.StreamCorruptedException java.io.StreamCorruptedException: invalid type code: AC at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:87) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:101) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:100) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:86 as TID 4949 on executor 20: tuan221 (PROCESS_LOCAL) 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:86 as 3129 bytes in 0 ms 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4785 (task 0.0:3) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5377) Dynamically add jar into Spark Driver's classpath.
[ https://issues.apache.org/jira/browse/SPARK-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5377: - Component/s: Spark Core Dynamically add jar into Spark Driver's classpath. -- Key: SPARK-5377 URL: https://issues.apache.org/jira/browse/SPARK-5377 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Chengxiang Li Spark support dynamically add jar to executor classpath through SparkContext::addJar(), while it does not support dynamically add jar into driver classpath. In most case(if not all the case), user dynamically add jar with SparkContext::addJar() because some classes from the jar would be referred in upcoming Spark job, which means the classes need to be loaded in Spark driver side either,e.g during serialization. I think it make sense to add an API to add jar into driver classpath, or just make it available in SparkContext::addJar(). HIVE-9410 is a real case from Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark
[ https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5113: - Component/s: Spark Core Audit and document use of hostnames and IP addresses in Spark - Key: SPARK-5113 URL: https://issues.apache.org/jira/browse/SPARK-5113 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Priority: Critical Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier (akka supports only supplying a single name which it uses both as the bind interface and as the actor identifier). In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. The best outcome would be to have three configs that can be set on each machine: {code} SPARK_LOCAL_IP # Ip address we bind to for all services SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within the cluster SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the cluster (e.g. the UI) {code} It's not clear how easily we can support that scheme while providing backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - it's just an alias for what is now SPARK_PUBLIC_DNS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2913) Spark's log4j.properties should always appear ahead of Hadoop's on classpath
[ https://issues.apache.org/jira/browse/SPARK-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2913: - Component/s: Deploy Spark's log4j.properties should always appear ahead of Hadoop's on classpath Key: SPARK-2913 URL: https://issues.apache.org/jira/browse/SPARK-2913 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0, 1.0.2, 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen In the current {{compute-classpath}} scripts, the Hadoop conf directory may appear before Spark's conf directory in the computed classpath. This leads to Hadoop's log4j.properties being used instead of Spark's, preventing users from easily changing Spark's logging settings. To fix this, we should add a new classpath entry for Spark's log4j.properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4351) Record cacheable RDD reads and display RDD miss rates
[ https://issues.apache.org/jira/browse/SPARK-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4351: - Component/s: Spark Core Record cacheable RDD reads and display RDD miss rates - Key: SPARK-4351 URL: https://issues.apache.org/jira/browse/SPARK-4351 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Charles Reiss Priority: Minor Currently, when Spark fails to keep an RDD cached, there is little visibility to the user (beyond performance effects), especially if the user is not reading executor logs. We could expose this information to the Web UI and the event log like we do for RDD storage information by reporting RDD reads and their results with task metrics. From this, live computation of RDD miss rates is straightforward, and information in the event log would enable more complicated post-hoc analyses. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications
[ https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4605: - Component/s: Project Infra Proposed Contribution: Spark Kernel to enable interactive Spark applications Key: SPARK-4605 URL: https://issues.apache.org/jira/browse/SPARK-4605 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Chip Senkbeil Attachments: Kernel Architecture Widescreen.pdf, Kernel Architecture.pdf Project available on Github: https://github.com/ibm-et/spark-kernel This architecture is describing running kernel code that was demonstrated at the StrataConf in Barcelona, Spain. Enables applications to interact with a Spark cluster using Scala in several ways: * Defining and running core Spark Tasks * Collecting results from a cluster without needing to write to external data store ** Ability to stream results using well-defined protocol * Arbitrary Scala code definition and execution (without submitting heavy-weight jars) Applications can be hosted and managed separate from the Spark cluster using the kernel as a proxy to communicate requests. The Spark Kernel implements the server side of the IPython Kernel protocol, the rising “de-facto” protocol for language (Python, Haskell, etc.) execution. Inherits a suite of industry adopted clients such as the IPython Notebook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
[ https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5581: - Component/s: Shuffle When writing sorted map output file, avoid open / close between each partition -- Key: SPARK-5581 URL: https://issues.apache.org/jira/browse/SPARK-5581 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.3.0 Reporter: Sandy Ryza {code} // Bypassing merge-sort; get an iterator by partition and just write everything directly. for ((id, elements) - this.partitionedIterator) { if (elements.hasNext) { val writer = blockManager.getDiskWriter( blockId, outputFile, ser, fileBufferSize, context.taskMetrics.shuffleWriteMetrics.get) for (elem - elements) { writer.write(elem) } writer.commitAndClose() val segment = writer.fileSegment() lengths(id) = segment.length } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5607) NullPointerException in objenesis
[ https://issues.apache.org/jira/browse/SPARK-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5607: - Component/s: Tests NullPointerException in objenesis - Key: SPARK-5607 URL: https://issues.apache.org/jira/browse/SPARK-5607 Project: Spark Issue Type: Bug Components: Tests Reporter: Reynold Xin Assignee: Patrick Wendell Fix For: 1.3.0 Tests are sometimes failing with the following exception. The problem might be that Kryo is using a different version of objenesis from Mockito. {code} [info] - Process succeeds instantly *** FAILED *** (107 milliseconds) [info] java.lang.NullPointerException: [info] at org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52) [info] at org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90) [info] at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73) [info] at org.mockito.internal.creation.jmock.ClassImposterizer.createProxy(ClassImposterizer.java:111) [info] at org.mockito.internal.creation.jmock.ClassImposterizer.imposterise(ClassImposterizer.java:51) [info] at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:52) [info] at org.mockito.internal.MockitoCore.mock(MockitoCore.java:41) [info] at org.mockito.Mockito.mock(Mockito.java:1014) [info] at org.mockito.Mockito.mock(Mockito.java:909) [info] at org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply$mcV$sp(DriverRunnerTest.scala:50) [info] at org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply(DriverRunnerTest.scala:47) [info] at org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply(DriverRunnerTest.scala:47) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuite.run(FunSuite.scala:1555) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:294) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:284) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info] at java.lang.Thread.run(Thread.java:745) {code} More
[jira] [Updated] (SPARK-5654) Integrate SparkR into Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5654: - Component/s: Project Infra Integrate SparkR into Apache Spark -- Key: SPARK-5654 URL: https://issues.apache.org/jira/browse/SPARK-5654 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Shivaram Venkataraman The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R. The project was started at the AMPLab around a year ago and has been incubated as its own project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern as described in our meetup talk[2], Spark Summit presentation[3]. Integrating SparkR into the Apache project will enable R users to use Spark out of the box and given R’s large user base, it will help the Spark project reach more users. Additionally, work in progress features like providing R integration with ML Pipelines and Dataframes can be better achieved by development in a unified code base. SparkR is available under the Apache 2.0 License and does not have any external dependencies other than requiring users to have R and Java installed on their machines. SparkR’s developers come from many organizations including UC Berkeley, Alteryx, Intel and we will support future development, maintenance after the integration. [1] https://github.com/amplab-extras/SparkR-pkg [2] http://files.meetup.com/3138542/SparkR-meetup.pdf [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5319) Choosing partition size instead of count
[ https://issues.apache.org/jira/browse/SPARK-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5319: - Component/s: Spark Core Choosing partition size instead of count Key: SPARK-5319 URL: https://issues.apache.org/jira/browse/SPARK-5319 Project: Spark Issue Type: Brainstorming Components: Spark Core Reporter: Idan Zalzberg With the current API, there are multiple locations when you can set the partition count when reading from sources. However IME, it is sometimes useful to set the partition size (in MB), and infer the count from that. IME, spark is sensitive to the partition size, if they are too big, it raises the amount of memory needed per core, and if they are too small then the stage times increase significantly, so I'd like to stay in the sweet spot of the partition size, without trying to change the partition count around until I find it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5340) Spark startup in local mode should not always create HTTP file server
[ https://issues.apache.org/jira/browse/SPARK-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5340. -- Resolution: Won't Fix Per PR discussion, WontFix. Spark startup in local mode should not always create HTTP file server - Key: SPARK-5340 URL: https://issues.apache.org/jira/browse/SPARK-5340 Project: Spark Issue Type: Improvement Reporter: Paul R. Brown In particular, I don't want the HTTP file server. The ui and other components can be disabled via configuration parameters, and the HTTP file server should receive similar treatment (IMHO). Created PR to just never create it in local mode: https://github.com/apache/spark/pull/4125 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5616) Add examples for PySpark API
[ https://issues.apache.org/jira/browse/SPARK-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-5616: -- Description: PySpark API examples are less than Spark scala API. For example: 1.Broadcast: how to use broadcast operation API 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. was: PySpark API examples are less than Spark scala API. For example: 1.Boardcast: how to use boardcast operation APi 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. Add examples for PySpark API Key: SPARK-5616 URL: https://issues.apache.org/jira/browse/SPARK-5616 Project: Spark Issue Type: Improvement Components: PySpark Reporter: dongxu Priority: Minor Labels: examples, pyspark, python PySpark API examples are less than Spark scala API. For example: 1.Broadcast: how to use broadcast operation API 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311447#comment-14311447 ] DeepakVohra commented on SPARK-5625: Extracting/opening with WinZip is only to verify the archive is valid. The following indicate that the spark assembly jar is not a valid archive. 1. Even though the assembly jar is in the classpath, a Spark application does not find the classes in the assembly jar. 2. The assembly jar does not get opened/extracted with WinZip which generates the error: http://s763.photobucket.com/user/dvohra10/media/SparkAssembly_zps4319294c.jpg.html?o=0 All indicators suggest the assembly jar is not a valid archive. Adding a Spark core artifact jar to the same directory, the lib directory of Spark binaries, adds the classes from the Spark Core to the classpath. Could it be verified: 1. The assembly jar gets extracted and is a valid archive? 2. Adding the jar in the classpath adds the classes to classpath? Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311447#comment-14311447 ] DeepakVohra edited comment on SPARK-5625 at 2/8/15 5:56 PM: The jar tf does list the Spark classes, which verifies the Binaries include the Spark artifact classes. The issue subject should be modified to: Is the Spark Assembly a Valid Archive? Extracting/opening with WinZip is only to verify the archive is valid. The following indicate that the spark assembly jar is not a valid archive. 1. Even though the assembly jar is in the classpath, a Spark application does not find the classes in the assembly jar. 2. The assembly jar does not get opened/extracted with WinZip which generates the error: http://s763.photobucket.com/user/dvohra10/media/SparkAssembly_zps4319294c.jpg.html?o=0 All indicators suggest the assembly jar is not a valid archive. Adding a Spark core artifact jar to the same directory, the lib directory of Spark binaries, adds the classes from the Spark Core to the classpath. Could it be verified: 1. The assembly jar gets extracted and is a valid archive? 2. Adding the jar in the classpath adds the classes to classpath? was (Author: dvohra): Extracting/opening with WinZip is only to verify the archive is valid. The following indicate that the spark assembly jar is not a valid archive. 1. Even though the assembly jar is in the classpath, a Spark application does not find the classes in the assembly jar. 2. The assembly jar does not get opened/extracted with WinZip which generates the error: http://s763.photobucket.com/user/dvohra10/media/SparkAssembly_zps4319294c.jpg.html?o=0 All indicators suggest the assembly jar is not a valid archive. Adding a Spark core artifact jar to the same directory, the lib directory of Spark binaries, adds the classes from the Spark Core to the classpath. Could it be verified: 1. The assembly jar gets extracted and is a valid archive? 2. Adding the jar in the classpath adds the classes to classpath? Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311475#comment-14311475 ] Patrick Wendell commented on SPARK-761: --- I think the main thing to catch would be Akka. I.e. try connecting different versions and seeing what happens as an exploratory step. For instance, if akka has a standard exception which says you had an incompatible message type, we can wrap that and give an outer exception explaining that the spark version is likely wrong. So maybe we can see if someone wants to explore this a bit as a starter task. Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Labels: starter Not sure what component this falls under, or if this is still an issue. Patrick Wendell / Matei Zaharia? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4687) SparkContext#addFile doesn't keep file folder information
[ https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4687: --- Component/s: Spark Core SparkContext#addFile doesn't keep file folder information - Key: SPARK-4687 URL: https://issues.apache.org/jira/browse/SPARK-4687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Jimmy Xiang Assignee: Sandy Ryza Fix For: 1.3.0, 1.4.0 Files added with SparkContext#addFile are loaded with Utils#fetchFile before a task starts. However, Utils#fetchFile puts all files under the Spart root on the worker node. We should have an option to keep the folder information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5299) Is http://www.apache.org/dist/spark/KEYS out of date?
[ https://issues.apache.org/jira/browse/SPARK-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5299: --- Component/s: (was: Deploy) Build Is http://www.apache.org/dist/spark/KEYS out of date? - Key: SPARK-5299 URL: https://issues.apache.org/jira/browse/SPARK-5299 Project: Spark Issue Type: Question Components: Build Reporter: David Shaw Assignee: Patrick Wendell The keys contained in http://www.apache.org/dist/spark/KEYS do not appear to match the keys used to sign the releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal
[ https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3033: --- Component/s: (was: Spark Core) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal Key: SPARK-3033 URL: https://issues.apache.org/jira/browse/SPARK-3033 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: pengyanhong run a complex HiveQL via yarn-cluster, got error as below: {quote} 14/08/14 15:05:24 WARN org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to java.lang.ClassCastException java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82) at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-761: -- Labels: starter (was: ) Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Labels: starter Not sure what component this falls under, or if this is still an issue. Patrick Wendell / Matei Zaharia? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-761: -- Description: As a starter task, it would be good to audit the current behavior for different client - server pairs with respect to how exceptions occur. (was: Not sure what component this falls under, or if this is still an issue. Patrick Wendell / Matei Zaharia?) Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Labels: starter As a starter task, it would be good to audit the current behavior for different client - server pairs with respect to how exceptions occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311485#comment-14311485 ] Andrew Ash commented on SPARK-761: -- Another thing could be a basic check for version number mismatches. E.g. a warning log from both server and client could say: Version mismatch between server (1.2.0) and client (1.1.1); proceeding anyway Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Labels: starter As a starter task, it would be good to audit the current behavior for different client - server pairs with respect to how exceptions occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311490#comment-14311490 ] Patrick Wendell commented on SPARK-761: --- [~aash] right now we don't explicitly encode the spark version anywhere in the RPC. The best possible thing is to give an explicit version number like you said, but we don't have the plumbing to do that at the moment and IMO that's worth punting until we decide to standardize the RPC format. Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Labels: starter As a starter task, it would be good to audit the current behavior for different client - server pairs with respect to how exceptions occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3242) Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default
[ https://issues.apache.org/jira/browse/SPARK-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311328#comment-14311328 ] Apache Spark commented on SPARK-3242: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4458 Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default -- Key: SPARK-3242 URL: https://issues.apache.org/jira/browse/SPARK-3242 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal
[ https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3033: - Priority: Major (was: Blocker) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal Key: SPARK-3033 URL: https://issues.apache.org/jira/browse/SPARK-3033 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.2 Reporter: pengyanhong run a complex HiveQL via yarn-cluster, got error as below: {quote} 14/08/14 15:05:24 WARN org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to java.lang.ClassCastException java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82) at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311348#comment-14311348 ] DeepakVohra commented on SPARK-5625: The other jars in the Spark binaries lib directory get opened/extracted except the assembly jar. Could it be verified that the assembly jar gets extracted? And which extraction tool is used? Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.
[ https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4869: - Component/s: (was: Spark Core) SQL Priority: Major (was: Blocker) The variable names in IF statement of Spark SQL doesn't resolve to its value. -- Key: SPARK-4869 URL: https://issues.apache.org/jira/browse/SPARK-4869 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1 Reporter: Ajay We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have to have nested “if” statements. But, spark sql is not able to resolve the variable names in final evaluation but the literal values are working. An Unresolved Attributes error is being thrown. Please fix this bug. This works: sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) as ROLL_BACKWARD FROM OUTER_RDD) This doesn’t : sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5139) select table_alias.* with joins and selecting column names from inner queries not supported
[ https://issues.apache.org/jira/browse/SPARK-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5139: - Priority: Major (was: Blocker) Issue Type: Improvement (was: Bug) select table_alias.* with joins and selecting column names from inner queries not supported Key: SPARK-5139 URL: https://issues.apache.org/jira/browse/SPARK-5139 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Environment: Eclipse + SBT as well as linux cluster Reporter: Sunita Koppar There are 2 issues here: 1. select table_alias.* on a joined query is not supported The exception thrown is as below: at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260) at croevss.WfPlsRej$.plsrej(WfPlsRej.scala:80) at croevss.WfPlsRej$.main(WfPlsRej.scala:40) at croevss.WfPlsRej.main(WfPlsRej.scala) 2. Multilevel nesting chokes up with messages like this: Exception in thread main org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: Below is a sample query which runs on hive, but fails due to the above reasons with Spark SQL. SELECT sq.* ,r.* FROM (SELECT cs.*, w.primary_key, w.id AS s_id1, w.d_cd, w.d_name, w.rd, w.completion_date AS completion_date1, w.sales_type AS sales_type1 FROM (SELECT stg.s_id, stg.c_id, stg.v, stg.flg1, stg.flg2, comstg.d1, comstg.d2, comstg.d3, FROM croe_rej_stage_pq stg JOIN croe_rej_stage_comments_pq comstg ON ( stg.s_id = comstg.s_id ) WHERE comstg.valid_flg_txt = 'Y' AND stg.valid_flg_txt = 'Y' ORDER BY stg.s_id) cs JOIN croe_rej_work_pq w ON ( cs.s_id = w.s_id )) sq JOIN CROE_rdr_pq r ON ( sq.d_cd = r.d_number ) This is very cumbersome to deal with and we end up creating StructTypes for every level. If there is a better way to deal with this, please let us know regards Sunita -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311378#comment-14311378 ] Sean Owen commented on SPARK-5625: -- As I've said, the assembly is a JAR file. You do not extract it in order to use it; you don't extract any JAR file to use it. However it is just a zip file. {{jar xf}} and {{unzip}} both successfully extract it. But to be clear, you do not need to do so. Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311347#comment-14311347 ] DeepakVohra commented on SPARK-5625: WinZIp version is the latest 18.5. Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311348#comment-14311348 ] DeepakVohra edited comment on SPARK-5625 at 2/8/15 3:21 PM: The error is not too many files. The error is the archive is not valid as in the screenshot. http://s763.photobucket.com/user/dvohra10/media/SparkAssembly_zps4319294c.jpg.html?o=0 The other jars in the Spark binaries lib directory get opened/extracted except the assembly jar. Could it be verified that the assembly jar gets extracted? And which extraction tool is used? was (Author: dvohra): The other jars in the Spark binaries lib directory get opened/extracted except the assembly jar. Could it be verified that the assembly jar gets extracted? And which extraction tool is used? Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5659) Flaky Test: org.apache.spark.streaming.ReceiverSuite.block
[ https://issues.apache.org/jira/browse/SPARK-5659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5659: --- Component/s: Tests Flaky Test: org.apache.spark.streaming.ReceiverSuite.block -- Key: SPARK-5659 URL: https://issues.apache.org/jira/browse/SPARK-5659 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Critical Labels: flaky-test {code} Error Message recordedBlocks.drop(1).dropRight(1).forall(((block: scala.collection.mutable.ArrayBuffer[Int]) = block.size.=(minExpectedMessagesPerBlock).(block.size.=(maxExpectedMessagesPerBlock was false # records in received blocks = [11,10,10,10,10,10,10,10,10,10,10,4,16,10,10,10,10,10,10,10], not between 7 and 11 Stacktrace sbt.ForkMain$ForkError: recordedBlocks.drop(1).dropRight(1).forall(((block: scala.collection.mutable.ArrayBuffer[Int]) = block.size.=(minExpectedMessagesPerBlock).(block.size.=(maxExpectedMessagesPerBlock was false # records in received blocks = [11,10,10,10,10,10,10,10,10,10,10,4,16,10,10,10,10,10,10,10], not between 7 and 11 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply$mcV$sp(ReceiverSuite.scala:200) at org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158) at org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceiverSuite.scala:39) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.ReceiverSuite.runTest(ReceiverSuite.scala:39) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$run(ReceiverSuite.scala:39) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.ReceiverSuite.run(ReceiverSuite.scala:39) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311500#comment-14311500 ] DeepakVohra commented on SPARK-5625: On re-test Spark classes get found in Spark application. But the following error is still generated with RunRecommender. Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1113) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:281) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:245) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1351) at org.apache.spark.rdd.RDD.reduce(RDD.scala:867) at org.apache.spark.rdd.DoubleRDDFunctions.stats(DoubleRDDFunctions.scala:43) at com.cloudera.datascience.recommender.RunRecommender$.preparation(RunRecommender.scala:63) at com.cloudera.datascience.recommender.RunRecommender$.main(RunRecommender.scala:29) at com.cloudera.datascience.recommender.RunRecommender.main(RunRecommender.scala) Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4
[ https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311506#comment-14311506 ] DeepakVohra commented on SPARK-5631: This means you have mismatched Hadoop versions, either between your Spark and Hadoop deployment, Hadoop version is hadoop-2.0.0-cdh4.2.0.tar.gz. Spark binaries are compiled with the same version: spark-1.2.0-bin-cdh4.tgz or because you included Hadoop code in your app. The Spark application is the RunRecommender application. Server IPC version 7 cannot communicate with client version 4 -- Key: SPARK-5631 URL: https://issues.apache.org/jira/browse/SPARK-5631 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: Scala 2.10.4 Spark 1.2 CDH4.2 Reporter: DeepakVohra A Spark application generates the error Server IPC version 7 cannot communicate with client version 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311530#comment-14311530 ] Apache Spark commented on SPARK-5021: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/4459 GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4
[ https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311531#comment-14311531 ] Sean Owen commented on SPARK-5631: -- So, one problem is that the {{cdh4}} binary is compiled vs {{2.0.0-mr1-cdh4.2.0}}. This may be the problem, that the build you downloaded is for a different flavor of CDH4. Although none of those are officially supported, I don't see why it wouldn't work to build Spark with {{-Pyarn -Phive -Phive-thriftserver -Dhadoop.version=2.0.0-cdh4.2.0}}. That would rule out that difference. The second potential difference, your app vs server, is avoided if you do not bundle Spark or Hadoop with your app, and run it with spark-submit. It doesn't matter what your app is. Server IPC version 7 cannot communicate with client version 4 -- Key: SPARK-5631 URL: https://issues.apache.org/jira/browse/SPARK-5631 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: Scala 2.10.4 Spark 1.2 CDH4.2 Reporter: DeepakVohra A Spark application generates the error Server IPC version 7 cannot communicate with client version 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311533#comment-14311533 ] Sean Owen commented on SPARK-5625: -- You asked this in a separate issue and it is discussed in SPARK-5631. Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311532#comment-14311532 ] Manoj Kumar commented on SPARK-5021: I have created a working pull request. Let us please take the discussion there. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5273) Improve documentation examples for LinearRegression
[ https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dev Lakhani updated SPARK-5273: --- Affects Version/s: (was: 1.2.0) Improve documentation examples for LinearRegression Key: SPARK-5273 URL: https://issues.apache.org/jira/browse/SPARK-5273 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Dev Lakhani Priority: Minor In the document: https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html Under Linear least squares, Lasso, and ridge regression The suggested method to use LinearRegressionWithSGD.train() // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) is not ideal even for simple examples such as y=x. This should be replaced with more real world parameters with step size: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) or LinearRegressionWithSGD.train(input,100,0.0001) To create a reasonable MSE. It took me a while using the dev forum to learn that the step size should be really small. Might help save someone the same effort when learning mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5273) Improve documentation examples for LinearRegression
[ https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dev Lakhani updated SPARK-5273: --- Affects Version/s: 1.2.0 Improve documentation examples for LinearRegression Key: SPARK-5273 URL: https://issues.apache.org/jira/browse/SPARK-5273 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Dev Lakhani Priority: Minor In the document: https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html Under Linear least squares, Lasso, and ridge regression The suggested method to use LinearRegressionWithSGD.train() // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) is not ideal even for simple examples such as y=x. This should be replaced with more real world parameters with step size: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) or LinearRegressionWithSGD.train(input,100,0.0001) To create a reasonable MSE. It took me a while using the dev forum to learn that the step size should be really small. Might help save someone the same effort when learning mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310130#comment-14310130 ] Sandy Ryza edited comment on SPARK-4550 at 2/8/15 9:07 PM: --- I got a working prototype and benchmarked the ExternalSorter changes on my laptop. Each run inserts a bunch of records, each a (Int, (10-character string, Int)) tuple, into an ExternalSorter and then calls writePartitionedFile. The reported memory size is the sum of the shuffle bytes spilled (mem) metric and the remaining size of the collection after insertion has completed. Results are averaged over three runs. Keep in mind that the primary goal here is to reduce GC pressure, so any speed improvements are icing. ||Number of Records||Storing as Serialized||Memory Size||Number of Spills||Insert Time (ms)||Write Time (ms)||Total Time|| |1 million|false|194923217|0|1123|3442|4566| |1 million|true|48694072|0|1315|2652|3967| |10 million|false|2050514159|3|26723|17418|44141| |10 million|true|613614392|1|16501|17151|33652| |50 million|false|10166122563|17|101831|89960|191791| |50 million|true|3067937592|5|76801|78361|155161| was (Author: sandyr): I got a working prototype and benchmarked the ExternalSorter changes on my laptop. Each run inserts a bunch of records, each a (Int, (10-character string, Int)) tuple, into an ExternalSorter and then calls writePartitionedFile. The reported memory size is the sum of the shuffle bytes spilled (mem) metric and the remaining size of the collection after insertion has completed. Results are averaged over three runs. Keep in mind that the primary goal here is to reduce GC pressure, so any speed improvements are icing. ||Number of Records||Storing as Serialized||Memory Size||Number of Spills||Insert Time (ms)||Write Time (ms)||Total Time|| |1 million|false|194923217|0|1123|3442|4566| |1 million|true|48694072|0|1315|2652|3967| |10 million|false|2050514159|3|26723|17418|44141| |10 million|true|613614392|1|16501|17151|33652| |10 million|false|10166122563|17|101831|89960|191791| |10 million|true|3067937592|5|76801|78361|155161| In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Attachments: SPARK-4550-design-v1.pdf One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4588) Add API for feature attributes
[ https://issues.apache.org/jira/browse/SPARK-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311586#comment-14311586 ] Apache Spark commented on SPARK-4588: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4460 Add API for feature attributes -- Key: SPARK-4588 URL: https://issues.apache.org/jira/browse/SPARK-4588 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Sean Owen Feature attributes, e.g., continuous/categorical, feature names, feature dimension, number of categories, number of nonzeros (support) could be useful for ML algorithms. In SPARK-3569, we added metadata to schema, which can be used to store feature attributes along with the dataset. We need to provide a wrapper over the Metadata class for ML usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5674) Spark Job Explain Plan Proof of Concept
Kostas Sakellis created SPARK-5674: -- Summary: Spark Job Explain Plan Proof of Concept Key: SPARK-5674 URL: https://issues.apache.org/jira/browse/SPARK-5674 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis This is just a prototype of creating an explain plan for a job. Code can be found here: https://github.com/ksakellis/spark/tree/kostas-explainPlan-poc The code was written very quickly and so doesn't have any comments, tests and is probably buggy - hence it being a proof of concept. *How to Use* # {code}sc.explainOn = sc.explainOff{code} This will generate the explain plain and print it in the logs # {code}sc.enableExecution = sc.disableExecution{code} This will disable executing of the job. Using these two knobs a user can choose to print the explain plan and/or disable the running of the job if they only want to see the plan. *Implementation* This is only a prototype and it is by no means production ready. The code is pretty hacky in places and a few shortcuts were made just to get the prototype working. The most interesting part of this commit is in the ExecutionPlanner.scala class. This class creates its own private instance of the DAGScheduler and passes into it a NoopTaskScheduler. The NoopTaskScheduler receives the created TaskSets from the DAGScheduler and records the stages - tasksets. The NoopTaskScheduler also creates fake CompletionsEvents and sends them to the DAGScheduler to move the scheduling along. It is done this way so that we can use the DAGScheduler unmodified thus reducing code divergence. The rest of the code is about processing the information produced by the ExecutionPlanner, creating a DAG with a bunch of metadata and printing it as a pretty ascii drawing. For drawing the DAG, https://github.com/mdr/ascii-graphs is used. This was just easier again to prototype. *How is this different than RDD#toDebugString?* The execution planner runs the job through the entire DAGScheduler so we can collect some metrics that are not presently available in the debugString. For example, we can report the binary size of the task which might be important if the closures are referencing large object. In addition, because we execute the scheduler code from an action, we can get a more accurate picture of where the stage boundaries and dependencies. An action such ask treeReduce will generate a number of stages that you can't get just by doing .toDebugString on the rdd. *Limitations of this Implementation* Because this is a prototype there are is a lot of lame stuff in this commit. # All of the code in SparkContext in particular sucks. This adds some code in the runJob() call and when it gets the plan it just writes it to the INFO log. We need to find a better way of exposing the plan to the caller so that they can print it, analyze it etc. Maybe we can use implicits or something? Not sure how best to do this yet. # Some of the actions will return through exceptions because we are basically faking a runJob(). If you want ot try this, it is best to just use count() instead of say collect(). This will get fixed when we fix 1) # Because the ExplainPlanner creates its own DAGScheduler, there currently is no way to map the real stages to the explainPlan stages. So if a user turns on explain plan, and doesn't disable execution, we can't automatically add more metrics to the explain plan as they become available. The stageId in the plan and the stageId in the real scheduler will be different. This is important for when we add it to the webUI and users can track progress on the DAG # We are using https://github.com/mdr/ascii-graphs to draw the DAG - not sure if we want to depend on that project. *Next Steps* # It would be good to get a few people to take a look at the code specifically at how the plan gets generated. Clone the package and give it a try with some of your jobs # If the approach looks okay overall, I can put together a mini design doc and add some answers to the above limitations of this approach. #Feedback most welcome. *Example Code:* {code} sc.explainOn sc.disableExecution val rdd = sc.parallelize(1 to 10, 4).map(key = (key.toString, key)) val rdd2 = sc.parallelize(1 to 5, 2).map(key = (key.toString, key)) rdd.join(rdd2) .count() {code} *Example Output:* {noformat} EXPLAIN PLAN: +---+ +---+ | | | | |Stage: 0 @ map | |Stage: 1 @ map | | Tasks: 4| | Tasks: 2| | | | | +---+ +---+ | | v v +-+ | | |Stage: 2 @ count | |Tasks: 4 | | | +-+ STAGE DETAILS: -- Stage: 0
[jira] [Commented] (SPARK-5635) Allow users to run .scala files directly from spark-submit
[ https://issues.apache.org/jira/browse/SPARK-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311588#comment-14311588 ] Grant Henke commented on SPARK-5635: The method I listed I thought to be a workaround and not necessarily intended functionality. Especially because I need to add exit to the bottom of the script to be sure I break out of interactive mode. I suggest adding the functionality to spark-submit because spark-shell does not share/support all features of spark-submit's functionality. Instead it supports uses and features around interactive/client use. This functionality is very similar to passing a Python script to spark-submit so it appeared the correct place to run a Scala script as well. Allow users to run .scala files directly from spark-submit -- Key: SPARK-5635 URL: https://issues.apache.org/jira/browse/SPARK-5635 Project: Spark Issue Type: New Feature Components: Spark Core, Spark Shell Reporter: Grant Henke Priority: Minor Similar to the python functionality allow users to submit .scala files. Currently the way I simulate this is to use spark-shell and run: `spark-shell -i myscript.scala` Note: user needs to add exit to the bottom of the script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core
[ https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311596#comment-14311596 ] DeepakVohra commented on SPARK-5625: Thanks Sean. Spark binaries do not incude Spark Core --- Key: SPARK-5625 URL: https://issues.apache.org/jira/browse/SPARK-5625 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: CDH4 Reporter: DeepakVohra Spark binaries for CDH 4 do not include the Spark Core Jar. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4
[ https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311597#comment-14311597 ] DeepakVohra commented on SPARK-5631: Thanks for the clarification. Server IPC version 7 cannot communicate with client version 4 -- Key: SPARK-5631 URL: https://issues.apache.org/jira/browse/SPARK-5631 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: Scala 2.10.4 Spark 1.2 CDH4.2 Reporter: DeepakVohra A Spark application generates the error Server IPC version 7 cannot communicate with client version 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4
[ https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311597#comment-14311597 ] DeepakVohra edited comment on SPARK-5631 at 2/8/15 10:38 PM: - Thanks for the clarification. The error gets removed. was (Author: dvohra): Thanks for the clarification. Server IPC version 7 cannot communicate with client version 4 -- Key: SPARK-5631 URL: https://issues.apache.org/jira/browse/SPARK-5631 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0 Environment: Scala 2.10.4 Spark 1.2 CDH4.2 Reporter: DeepakVohra A Spark application generates the error Server IPC version 7 cannot communicate with client version 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2958) FileClientHandler should not be shared in the pipeline
[ https://issues.apache.org/jira/browse/SPARK-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311602#comment-14311602 ] Reynold Xin commented on SPARK-2958: cc [~adav] this is no longer a problem in the new shuffle module, is it? FileClientHandler should not be shared in the pipeline -- Key: SPARK-2958 URL: https://issues.apache.org/jira/browse/SPARK-2958 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Netty module creates a single FileClientHandler and shares it in all threads. We should create a new one for each pipeline thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3991) Not Serializable , Nullpinter Exceptions in SQL server mode
[ https://issues.apache.org/jira/browse/SPARK-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3991: - Priority: Major (was: Blocker) Downgrading until it's clear what the issue is. There are several items here. 1. This sounds like the same issue raised in SPARK-4944 2. You might need to provide more info, like what the nature of the join is 3. This sounds related to SPARK-3914, maybe solved by it I suggest tracking one issue per JIRA. If one of these are still relevant and not duplicates, maybe this can change to track that one, and if there are more than one, track one here and create another JIRA for another. Not Serializable , Nullpinter Exceptions in SQL server mode --- Key: SPARK-3991 URL: https://issues.apache.org/jira/browse/SPARK-3991 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: eblaas Attachments: not_serializable_exception.patch I'm working on connecting Mondrian with Spark SQL via JDBC. Good news, it works but there are some bugs to fix. I customized the HiveThriftServer2 class to load, transform and register tables (ETL) with the HiveContext. Data tables are generated from Cassandra and from a relational database. * 1 st problem : hiveContext.registerRDDAsTable(treeSchema,tree) , does not register the table in hive metastore (show tables; via JDBC does not list the table, but I can query it e.g. select * from tree) dirty workaround create a table with same name and schema, this was necessary because mondrian validates table existence hiveContext.sql(CREATE TABLE tree (dp_id BIGINT, h1 STRING, h2 STRING, h3 STRING)) * 2 nd problem : mondrian creates complex joins, witch results in Serialization Exceptions 2 classes in hibeUdfs.scala have to be serializable - DeferredObjectAdapter and HiveGenericUdaf * 3 td problem Nullpointer Exception in InMemoryRelation 42: override lazy val statistics = Statistics(sizeInBytes = child.sqlContext.defaultSizeInBytes) the sqlContext in child was null, quick fix set default value from SparkContext override lazy val statistics = Statistics(sizeInBytes = 1) I'm not sure how to fix this bugs but with the patch file it works at least. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
[ https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3034: - Component/s: (was: Spark Core) Priority: Major (was: Blocker) Can you provide steps to reproduce this, and/or check whether it's still an issue? downgrading until there is more info. [HIve] java.sql.Date cannot be cast to java.sql.Timestamp - Key: SPARK-3034 URL: https://issues.apache.org/jira/browse/SPARK-3034 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: pengyanhong run a simple HiveQL via yarn-cluster, got error as below: {quote} Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date cannot be cast to java.sql.Timestamp org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33) org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
[jira] [Resolved] (SPARK-2998) scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet
[ https://issues.apache.org/jira/browse/SPARK-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2998. -- Resolution: Duplicate scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet -- Key: SPARK-2998 URL: https://issues.apache.org/jira/browse/SPARK-2998 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: pengyanhong Priority: Blocker run a HiveQL via yarn-cluster, got error as below: {quote} 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Serialized task 8.0:2 as 20849 bytes in 0 ms 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Finished TID 812 in 24 ms on A01-R06-I149-32.jd.local (progress: 2/200) 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Completed ResultTask(8, 1) 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Failed to run reduce at joins.scala:336 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): finishApplicationMaster with FAILED Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:336) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:813) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:810) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Invoking sc stop from shutdown hook 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): AppMaster received a signal. 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Starting task 8.0:3 as TID 814 on executor 1: A01-R06-I149-32.jd.local (PROCESS_LOCAL) 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Serialized task 8.0:3 as 20849 bytes in 0 ms 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Finished TID 813 in 25 ms on A01-R06-I149-32.jd.local (progress: 3/200) 14/08/13 11:10:01 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:58): Completed ResultTask(8, 2) .. {quote} It runs successfully if removing the configuration about Kryo -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4840) Incorrect documentation of master url on Running Spark on Mesos page
[ https://issues.apache.org/jira/browse/SPARK-4840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4840. -- Resolution: Not a Problem OK, if that doesn't prove to be the answer, reopen with more info. Incorrect documentation of master url on Running Spark on Mesos page Key: SPARK-4840 URL: https://issues.apache.org/jira/browse/SPARK-4840 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Sam Stoelinga Priority: Minor in paragraph Using a Mesos Master URL. there is the current sentence: or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper. this should be or mesos://zk://host:2181/mesos for a multi-master Mesos cluster using ZooKeeper. If you don't add mesos to the end of the url spark-shell wouldn't start for me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org