[jira] [Commented] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159453#comment-14159453 ] Davies Liu commented on SPARK-3721: --- [~bmiller1] This is an PR for this https://github.com/apache/spark/pull/2659, could you help to test it? Broadcast Variables above 2GB break in PySpark -- Key: SPARK-3721 URL: https://issues.apache.org/jira/browse/SPARK-3721 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Brad Miller Assignee: Davies Liu The bug displays 3 unique failure modes in PySpark, all of which seem to be related to broadcast variable size. Note that the tests below ran python 2.7.3 on all machines and used the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] {noformat} check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True {noformat} **BLOCK 3** [no problem] {noformat} check_unserialized(20) correct value recovered: True {noformat} **BLOCK 4** [no problem] {noformat} check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True {noformat} **BLOCK 5** [no problem] {noformat} check_unserialized(27) correct value recovered: True {noformat} **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** {noformat} check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set {noformat} **BLOCK 7** [no problem] {noformat} check_unserialized(28) correct value recovered: True {noformat} **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** {noformat} check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False {noformat} **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** {noformat} check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int {noformat} **BLOCK 10** [ERROR 1] {noformat} check_pre_serialized(30) ...same as above... {noformat} **BLOCK 11** [ERROR 3] {noformat} check_unserialized(30) ...same as above... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159454#comment-14159454 ] Apache Spark commented on SPARK-3721: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2659 Broadcast Variables above 2GB break in PySpark -- Key: SPARK-3721 URL: https://issues.apache.org/jira/browse/SPARK-3721 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Brad Miller Assignee: Davies Liu The bug displays 3 unique failure modes in PySpark, all of which seem to be related to broadcast variable size. Note that the tests below ran python 2.7.3 on all machines and used the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] {noformat} check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True {noformat} **BLOCK 3** [no problem] {noformat} check_unserialized(20) correct value recovered: True {noformat} **BLOCK 4** [no problem] {noformat} check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True {noformat} **BLOCK 5** [no problem] {noformat} check_unserialized(27) correct value recovered: True {noformat} **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** {noformat} check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set {noformat} **BLOCK 7** [no problem] {noformat} check_unserialized(28) correct value recovered: True {noformat} **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** {noformat} check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False {noformat} **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** {noformat} check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int {noformat} **BLOCK 10** [ERROR 1] {noformat} check_pre_serialized(30) ...same as above... {noformat} **BLOCK 11** [ERROR 3] {noformat} check_unserialized(30) ...same as above... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3801) Make app dir cleanup more efficient
[ https://issues.apache.org/jira/browse/SPARK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159458#comment-14159458 ] Apache Spark commented on SPARK-3801: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/2660 Make app dir cleanup more efficient --- Key: SPARK-3801 URL: https://issues.apache.org/jira/browse/SPARK-3801 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor The newly merged more conservative app directory cleanup can be more efficient. Since it is not needed to store newer files, using 'exists' instead of 'filter' can be more efficient because redundant items are skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3802) Scala version is wrong in dev/audit-release/blank_sbt_build/build.sbt
Kousuke Saruta created SPARK-3802: - Summary: Scala version is wrong in dev/audit-release/blank_sbt_build/build.sbt Key: SPARK-3802 URL: https://issues.apache.org/jira/browse/SPARK-3802 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor In dev/audit-release/blank_sbt_build/build.sbt, scalaVersion indicates 2.9.3 but I think 2.10.4 is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3802) Scala version is wrong in dev/audit-release/blank_sbt_build/build.sbt
[ https://issues.apache.org/jira/browse/SPARK-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159501#comment-14159501 ] Apache Spark commented on SPARK-3802: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2661 Scala version is wrong in dev/audit-release/blank_sbt_build/build.sbt - Key: SPARK-3802 URL: https://issues.apache.org/jira/browse/SPARK-3802 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor In dev/audit-release/blank_sbt_build/build.sbt, scalaVersion indicates 2.9.3 but I think 2.10.4 is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masaru Dobashi updated SPARK-3803: -- Description: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {quote} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {quote} The RowMatrix instance was generated from the result of TF-IDF like the following. {quote} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {quote} ArrayIndexOutOfBoundsException found in executing computePrincipalComponents Key: SPARK-3803 URL: https://issues.apache.org/jira/browse/SPARK-3803 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Masaru Dobashi When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {quote} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masaru Dobashi updated SPARK-3803: -- Description: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} was: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {quote} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
[jira] [Commented] (SPARK-3794) Building spark core fails with specific hadoop version
[ https://issues.apache.org/jira/browse/SPARK-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159524#comment-14159524 ] Sean Owen commented on SPARK-3794: -- The real problem here is that commons-io is not a dependency of Spark, and should not be, but it began to be used a few days ago in commit https://github.com/apache/spark/commit/cf1d32e3e1071829b152d4b597bf0a0d7a5629a2 for SPARK-1860. So it is accidentally depending on the version of Commons IO brought in by third party dependencies. I will propose a PR that removes this usage in favor of Guava or Java APIs. Building spark core fails with specific hadoop version -- Key: SPARK-3794 URL: https://issues.apache.org/jira/browse/SPARK-3794 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5 Reporter: cocoatomo Labels: spark At the commit cf1d32e3e1071829b152d4b597bf0a0d7a5629a2, building spark core result in compilation error when we specify some hadoop versions. To reproduce this issue, we should execute following command with hadoop.version=1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, or 2.2.0. {noformat} $ cd ./core $ mvn -Dhadoop.version=hadoop.version -DskipTests clean compile ... [ERROR] /Users/tomohiko/MyRepos/Scala/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:720: value listFilesAndDirs is not a member of object org.apache.commons.io.FileUtils [ERROR] val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE) [ERROR] ^ {noformat} Because that compilation uses commons-io version 2.1 and FileUtils#listFilesAndDirs method was added at commons-io version 2.2, this compilation always fails. FileUtils#listFilesAndDirs → http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFilesAndDirs%28java.io.File,%20org.apache.commons.io.filefilter.IOFileFilter,%20org.apache.commons.io.filefilter.IOFileFilter%29 Because a hadoop-client in those problematic version depends on commons-io 2.1 not 2.4, we should have assumption that commons-io is version 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159559#comment-14159559 ] Oleg Zhurakousky commented on SPARK-3561: - Sandy, one other thing: While I understand the reasoning for changes to the title and the description of the JIRA, it would probably be better to coordinate this with the original submitter before making such changes in the future (similar to the way Patric suggested in SPARK-3174). This would alleviate potential discrepancies in the overall message and intentions of the JIRA. Anyway, I’ve edited both the title and the description taking into consideration your edits. Expose pluggable architecture to facilitate native integration with third-party execution environments. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3801) Make app dir cleanup more efficient
[ https://issues.apache.org/jira/browse/SPARK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-3801. -- Resolution: Duplicate Make app dir cleanup more efficient --- Key: SPARK-3801 URL: https://issues.apache.org/jira/browse/SPARK-3801 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor The newly merged more conservative app directory cleanup can be more efficient. Since it is not needed to store newer files, using 'exists' instead of 'filter' can be more efficient because redundant items are skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3801) Make app dir cleanup more efficient
[ https://issues.apache.org/jira/browse/SPARK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159561#comment-14159561 ] Liang-Chi Hsieh commented on SPARK-3801: I think so. Thanks. Make app dir cleanup more efficient --- Key: SPARK-3801 URL: https://issues.apache.org/jira/browse/SPARK-3801 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor The newly merged more conservative app directory cleanup can be more efficient. Since it is not needed to store newer files, using 'exists' instead of 'filter' can be more efficient because redundant items are skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159573#comment-14159573 ] Sean Owen commented on SPARK-3561: -- I'd be interested to see a more specific motivating use case. Is this about using Tez for example, and where does it help to stack Spark on Tez on YARN? or MR2, etc. Spark Core and Tez overlap, to be sure, and I'm not sure how much value it adds to run one on the other. Kind of like running Oracle on MySQL or something. For whatever it is: is it maybe not more natural to integrate the feature into Spark itself? It would be great if it this were all just a matter of one extra trait and interface. In practice I suspect there are a number of hidden assumptions throughout the code that may leak through attempts at this abstraction. I am definitely asking rather than asserting, curious to see more specifics about the upside. Expose pluggable architecture to facilitate native integration with third-party execution environments. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3007) Add Dynamic Partition support to Spark Sql hive
[ https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159578#comment-14159578 ] Apache Spark commented on SPARK-3007: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2663 Add Dynamic Partition support to Spark Sql hive --- Key: SPARK-3007 URL: https://issues.apache.org/jira/browse/SPARK-3007 Project: Spark Issue Type: Improvement Components: SQL Reporter: baishuo Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. was: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. Expose pluggable architecture to facilitate native integration with third-party execution environments.
[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of JobExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. was: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. Expose pluggable architecture to facilitate native integration with third-party execution environments.
[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masaru Dobashi updated SPARK-3803: -- Description: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} I think this was because I created HashingTF instance with default numFeatures and Array is used in RowMatrix#computeGramianMatrix method like the following. {code} /** * Computes the Gramian matrix `A^T A`. */ def computeGramianMatrix(): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) = { RowMatrix.dspr(1.0, v, U.data) U }, combOp = (U1, U2) = U1 += U2) RowMatrix.triuToFull(n, GU.data) } {code} When the size of Vectors generated by TF-IDF is too large, it makes nt to have undesirable value (and undesirable size of Array used in treeAggregate), since n * (n + 1) / 2 exceeded Int.MaxValue. Is this surmise correct? And, of course, I could avoid this situation by creating instance of HashingTF with smaller numFeatures. But this seems to be not fundamental solution. was: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException:
[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masaru Dobashi updated SPARK-3803: -- Description: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} I think this was because I created HashingTF instance with default numFeatures and Array is used in RowMatrix#computeGramianMatrix method like the following. {code} /** * Computes the Gramian matrix `A^T A`. */ def computeGramianMatrix(): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) = { RowMatrix.dspr(1.0, v, U.data) U }, combOp = (U1, U2) = U1 += U2) RowMatrix.triuToFull(n, GU.data) } {code} When the size of Vectors generated by TF-IDF is too large, it makes nt to have undesirable value (and undesirable size of Array used in treeAggregate), since n * (n + 1) / 2 exceeded Int.MaxValue. Is this surmise correct? And, of course, I could avoid this situation by creating instance of HashingTF with smaller numFeatures. But this does not seems to be fundamental solution. was: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException:
[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masaru Dobashi updated SPARK-3803: -- Description: GrWhen I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} I think this was because I created HashingTF instance with default numFeatures and Array is used in RowMatrix#computeGramianMatrix method like the following. {code} /** * Computes the Gramian matrix `A^T A`. */ def computeGramianMatrix(): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) = { RowMatrix.dspr(1.0, v, U.data) U }, combOp = (U1, U2) = U1 += U2) RowMatrix.triuToFull(n, GU.data) } {code} When the size of Vectors generated by TF-IDF is too large, it makes nt to have undesirable value (and undesirable size of Array used in treeAggregate), since n * (n + 1) / 2 exceeded Int.MaxValue. Is this surmise correct? And, of course, I could avoid this situation by creating instance of HashingTF with smaller numFeatures. But this may not be fundamental solution. was: GrWhen I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException:
[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masaru Dobashi updated SPARK-3803: -- Description: When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} I think this was because I created HashingTF instance with default numFeatures and Array is used in RowMatrix#computeGramianMatrix method like the following. {code} /** * Computes the Gramian matrix `A^T A`. */ def computeGramianMatrix(): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) = { RowMatrix.dspr(1.0, v, U.data) U }, combOp = (U1, U2) = U1 += U2) RowMatrix.triuToFull(n, GU.data) } {code} When the size of Vectors generated by TF-IDF is too large, it makes nt to have undesirable value (and undesirable size of Array used in treeAggregate), since n * (n + 1) / 2 exceeded Int.MaxValue. Is this surmise correct? And, of course, I could avoid this situation by creating instance of HashingTF with smaller numFeatures. But this may not be fundamental solution. was: GrWhen I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
[jira] [Commented] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159587#comment-14159587 ] Sean Owen commented on SPARK-3803: -- I agree with your assessment. It would take some work, though not terribly much, to rewrite this method to correctly handle A with more than 46340 columns. At n = 46340, the Gramian already consumes about 8.5GB of memory, so it's kinda getting big to realistically use in core anyway. At the least, an error should be raised if n is too large. Any one else think this should be supported though? Would be nice, but, practically helpful? ArrayIndexOutOfBoundsException found in executing computePrincipalComponents Key: SPARK-3803 URL: https://issues.apache.org/jira/browse/SPARK-3803 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Masaru Dobashi When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} I think this was because I created HashingTF instance with default numFeatures and Array is used in RowMatrix#computeGramianMatrix method like the following. {code} /** * Computes the Gramian matrix `A^T A`. */ def computeGramianMatrix(): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) =
[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of JobExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. Expose pluggable architecture to facilitate native integration with third-party execution environments. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext
[jira] [Updated] (SPARK-3597) MesosSchedulerBackend does not implement `killTask`
[ https://issues.apache.org/jira/browse/SPARK-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3597: - Assignee: Brenden Matthews MesosSchedulerBackend does not implement `killTask` --- Key: SPARK-3597 URL: https://issues.apache.org/jira/browse/SPARK-3597 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Assignee: Brenden Matthews Fix For: 1.1.1, 1.2.0 The MesosSchedulerBackend class does not implement `killTask`, and therefore results in exceptions like this: 14/09/19 01:52:53 ERROR TaskSetManager: Task 238 in stage 1.0 failed 4 times; aborting job 14/09/19 01:52:53 INFO TaskSchedulerImpl: Cancelling stage 1 14/09/19 01:52:53 INFO DAGScheduler: Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:194) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:192) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:185) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1197) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3597) MesosSchedulerBackend does not implement `killTask`
[ https://issues.apache.org/jira/browse/SPARK-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3597. Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.1.1, 1.2.0 MesosSchedulerBackend does not implement `killTask` --- Key: SPARK-3597 URL: https://issues.apache.org/jira/browse/SPARK-3597 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Assignee: Brenden Matthews Fix For: 1.1.1, 1.2.0 The MesosSchedulerBackend class does not implement `killTask`, and therefore results in exceptions like this: 14/09/19 01:52:53 ERROR TaskSetManager: Task 238 in stage 1.0 failed 4 times; aborting job 14/09/19 01:52:53 INFO TaskSchedulerImpl: Cancelling stage 1 14/09/19 01:52:53 INFO DAGScheduler: Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:194) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:192) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:185) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1197) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Summary: Native Hadoop/YARN integration for batch/ETL workloads. (was: Expose pluggable architecture to facilitate native integration with third-party execution environments.) Native Hadoop/YARN integration for batch/ETL workloads. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159593#comment-14159593 ] Oleg Zhurakousky commented on SPARK-3561: - After giving it some thought, I am changing the title and the description back to the original as it would be more appropriate to discuss whatever question anyone may have via comments. Native Hadoop/YARN integration for batch/ETL workloads. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2445) MesosExecutorBackend crashes in fine grained mode
[ https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159602#comment-14159602 ] Brenden Matthews edited comment on SPARK-2445 at 10/5/14 5:45 PM: -- [~drexin] can you confirm that SPARK-3535 resolves your issue? was (Author: brenden): [~drexin] can you confirm that https://issues.apache.org/jira/browse/SPARK-3535 resolves your issue? MesosExecutorBackend crashes in fine grained mode - Key: SPARK-2445 URL: https://issues.apache.org/jira/browse/SPARK-2445 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Dario Rexin When multiple instances of the MesosExecutorBackend are running on the same slave, they will have the same executorId assigned (equal to the mesos slaveId), but will have a different port (which is randomly assigned). Because of this, it can not register a new BlockManager, because one is already registered with the same executorId, but a different BlockManagerId. More description and a fix can be found in this PR on GitHub: https://github.com/apache/spark/pull/1358 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2445) MesosExecutorBackend crashes in fine grained mode
[ https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159602#comment-14159602 ] Brenden Matthews commented on SPARK-2445: - [~drexin] can you confirm that https://issues.apache.org/jira/browse/SPARK-3535 resolves your issue? MesosExecutorBackend crashes in fine grained mode - Key: SPARK-2445 URL: https://issues.apache.org/jira/browse/SPARK-2445 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Dario Rexin When multiple instances of the MesosExecutorBackend are running on the same slave, they will have the same executorId assigned (equal to the mesos slaveId), but will have a different port (which is randomly assigned). Because of this, it can not register a new BlockManager, because one is already registered with the same executorId, but a different BlockManagerId. More description and a fix can be found in this PR on GitHub: https://github.com/apache/spark/pull/1358 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3804) Output of Generator expressions is not stable after serialization.
Michael Armbrust created SPARK-3804: --- Summary: Output of Generator expressions is not stable after serialization. Key: SPARK-3804 URL: https://issues.apache.org/jira/browse/SPARK-3804 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust The output of a generator expression (such as explode) can change after serialization. This caused a nasty data corruption bug. While this bug has seen been addressed it would be good to fix this since it violates the general contract of query plans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159707#comment-14159707 ] Andrew Ash commented on SPARK-1860: --- Filed as SPARK-3805 Standalone Worker cleanup should not clean up running executors --- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Blocker The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3805) Enable Standalone worker cleanup by default
Andrew Ash created SPARK-3805: - Summary: Enable Standalone worker cleanup by default Key: SPARK-3805 URL: https://issues.apache.org/jira/browse/SPARK-3805 Project: Spark Issue Type: Task Reporter: Andrew Ash Now that SPARK-1860 is fixed we should be able to turn on {{spark.worker.cleanup.enabled}} by default -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3805) Enable Standalone worker cleanup by default
[ https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159711#comment-14159711 ] Apache Spark commented on SPARK-3805: - User 'ash211' has created a pull request for this issue: https://github.com/apache/spark/pull/2664 Enable Standalone worker cleanup by default --- Key: SPARK-3805 URL: https://issues.apache.org/jira/browse/SPARK-3805 Project: Spark Issue Type: Task Reporter: Andrew Ash Now that SPARK-1860 is fixed we should be able to turn on {{spark.worker.cleanup.enabled}} by default -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3464) Graceful decommission of executors
[ https://issues.apache.org/jira/browse/SPARK-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159712#comment-14159712 ] Andrew Ash commented on SPARK-3464: --- [~pwendell] what JIRA replaced this one? Graceful decommission of executors -- Key: SPARK-3464 URL: https://issues.apache.org/jira/browse/SPARK-3464 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza Assignee: Andrew Or In most cases, even when an application is utilizing only a small fraction of its available resources, executors will still have tasks running or blocks cached. It would be useful to have a mechanism for waiting for running tasks on an executor to finish and migrating its cached blocks elsewhere before discarding it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3166) Custom serialisers can't be shipped in application jars
[ https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3166: -- Target Version/s: 1.2.0 Custom serialisers can't be shipped in application jars --- Key: SPARK-3166 URL: https://issues.apache.org/jira/browse/SPARK-3166 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Spark cannot currently use a custom serialiser that is shipped with the application jar. Trying to do this causes a java.lang.ClassNotFoundException when trying to instantiate the custom serialiser in the Executor processes. This occurs because Spark attempts to instantiate the custom serialiser before the application jar has been shipped to the Executor process. A reproduction of the problem is available here: https://github.com/GrahamDennis/spark-custom-serialiser I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches as of August 21, 2014. This issue is related to SPARK-2878, and my fix for that issue (https://github.com/apache/spark/pull/1890) also solves this. My pull request was not merged because it adds the user jar to the Executor processes' class path at launch time. Such a significant change was thought by [~rxin] to require more QA, and should be considered for inclusion in 1.2 at the earliest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3755) Do not bind port 1 - 1024 to server in spark
[ https://issues.apache.org/jira/browse/SPARK-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159737#comment-14159737 ] Andrew Ash commented on SPARK-3755: --- I think you can only edit the title and description of a ticket when the ticket is open (and this is now closed) Do not bind port 1 - 1024 to server in spark Key: SPARK-3755 URL: https://issues.apache.org/jira/browse/SPARK-3755 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Assignee: wangfei Fix For: 1.1.1, 1.2.0 Non-root user use port 1- 1024 to start jetty server will get the exception java.net.SocketException: Permission denied -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3369: -- Labels: breaking_change (was: ) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Critical Labels: breaking_change Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3792) enable JavaHiveQLSuite
[ https://issues.apache.org/jira/browse/SPARK-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3792. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2652 [https://github.com/apache/spark/pull/2652] enable JavaHiveQLSuite -- Key: SPARK-3792 URL: https://issues.apache.org/jira/browse/SPARK-3792 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3645) Make caching using SQL commands eager by default, with the option of being lazy
[ https://issues.apache.org/jira/browse/SPARK-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3645. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2513 [https://github.com/apache/spark/pull/2513] Make caching using SQL commands eager by default, with the option of being lazy --- Key: SPARK-3645 URL: https://issues.apache.org/jira/browse/SPARK-3645 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]
[ https://issues.apache.org/jira/browse/SPARK-3776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3776. - Resolution: Fixed Issue resolved by pull request 2641 [https://github.com/apache/spark/pull/2641] Wrong conversion to Catalyst for Option[Product] Key: SPARK-3776 URL: https://issues.apache.org/jira/browse/SPARK-3776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Renat Yusupov Fix For: 1.2.0 Method ScalaReflection.convertToCatalyst make wrong conversion for Option[Product] data. For example: case class A(intValue: Int) case class B(optionA: Option[A]) val b = B(Some(A(5))) ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3794) Building spark core fails due to inadvertent dependency on Commons IO
[ https://issues.apache.org/jira/browse/SPARK-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3794. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2662 [https://github.com/apache/spark/pull/2662] Building spark core fails due to inadvertent dependency on Commons IO - Key: SPARK-3794 URL: https://issues.apache.org/jira/browse/SPARK-3794 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5 Reporter: cocoatomo Labels: spark Fix For: 1.2.0 At the commit cf1d32e3e1071829b152d4b597bf0a0d7a5629a2, building spark core result in compilation error when we specify some hadoop versions. To reproduce this issue, we should execute following command with hadoop.version=1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, or 2.2.0. {noformat} $ cd ./core $ mvn -Dhadoop.version=hadoop.version -DskipTests clean compile ... [ERROR] /Users/tomohiko/MyRepos/Scala/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:720: value listFilesAndDirs is not a member of object org.apache.commons.io.FileUtils [ERROR] val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE) [ERROR] ^ {noformat} Because that compilation uses commons-io version 2.1 and FileUtils#listFilesAndDirs method was added at commons-io version 2.2, this compilation always fails. FileUtils#listFilesAndDirs → http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFilesAndDirs%28java.io.File,%20org.apache.commons.io.filefilter.IOFileFilter,%20org.apache.commons.io.filefilter.IOFileFilter%29 Because a hadoop-client in those problematic version depends on commons-io 2.1 not 2.4, we should have assumption that commons-io is version 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} was: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: - *averagePrecision(position: Int): Double* this is the presicion@position - *meanAveragePrecision*: the average of precision@n for all values of n - *ndcg* Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} was: Include common metrics for ranking algorithms(http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Include common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include common metrics for ranking algorithms(http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} was: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Include common metrics for ranking algorithms(http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3806) minor bug and exception in CliSuite
wangfei created SPARK-3806: -- Summary: minor bug and exception in CliSuite Key: SPARK-3806 URL: https://issues.apache.org/jira/browse/SPARK-3806 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Clisuite throw exception as follows: Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175) at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162) at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73) at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3805) Enable Standalone worker cleanup by default
[ https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159837#comment-14159837 ] Patrick Wendell commented on SPARK-3805: I commented on the PR - but because this deletes user data, I don't think it's a good idea to enable it by default. Enable Standalone worker cleanup by default --- Key: SPARK-3805 URL: https://issues.apache.org/jira/browse/SPARK-3805 Project: Spark Issue Type: Task Reporter: Andrew Ash Now that SPARK-1860 is fixed we should be able to turn on {{spark.worker.cleanup.enabled}} by default -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159838#comment-14159838 ] Patrick Wendell commented on SPARK-3561: [~ozhurakousky] do you have a timeline for posting a more complete design doc for what the idea is with this? Native Hadoop/YARN integration for batch/ETL workloads. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3561: --- Summary: Allow for pluggable execution contexts in Spark (was: Native Hadoop/YARN integration for batch/ETL workloads.) Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159840#comment-14159840 ] Patrick Wendell commented on SPARK-3561: I also changed the title here that reflects the current design doc and JIRA. We have a culture in the project of having JIRA titles reflect accurately the current proposal. We can change it again if a new doc causes the scope of this to change. [~ozhurakousky] I'd prefer not to change the title back until there is a new design proposed. The problem is that people are confusing this with SPARK-3174 and SPARK-3797. Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159840#comment-14159840 ] Patrick Wendell edited comment on SPARK-3561 at 10/6/14 3:55 AM: - I also changed the title here that reflects the current design doc and pull request. We have a culture in the project of having JIRA titles reflect accurately the current proposal. We can change it again if a new doc causes the scope of this to change. [~ozhurakousky] I'd prefer not to change the title back until there is a new design proposed. The problem is that people are confusing this with SPARK-3174 and SPARK-3797. was (Author: pwendell): I also changed the title here that reflects the current design doc and JIRA. We have a culture in the project of having JIRA titles reflect accurately the current proposal. We can change it again if a new doc causes the scope of this to change. [~ozhurakousky] I'd prefer not to change the title back until there is a new design proposed. The problem is that people are confusing this with SPARK-3174 and SPARK-3797. Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3761: --- Priority: Major (was: Blocker) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 - Key: SPARK-3761 URL: https://issues.apache.org/jira/browse/SPARK-3761 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Igor Tkachenko I have Scala code: val master = spark://server address:7077 val sc = new SparkContext(new SparkConf() .setMaster(master) .setAppName(SparkQueryDemo 01) .set(spark.executor.memory, 512m)) val count2 = sc .textFile(hdfs://server address:8020/tmp/data/risk/account.txt) .filter(line = line.contains(Word)) .count() I've got such an error: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host server address: java.lang.ClassNotFoundExcept ion: SimpleApp$$anonfun$1 My dependencies : object Version { val spark= 1.0.0-cdh5.1.0 } object Library { val sparkCore = org.apache.spark % spark-assembly_2.10 % Version.spark } My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3806) minor bug and exception in CliSuite
[ https://issues.apache.org/jira/browse/SPARK-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159842#comment-14159842 ] Apache Spark commented on SPARK-3806: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/2666 minor bug and exception in CliSuite --- Key: SPARK-3806 URL: https://issues.apache.org/jira/browse/SPARK-3806 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Clisuite throw exception as follows: Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175) at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162) at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73) at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3761: --- Component/s: Spark Core Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 - Key: SPARK-3761 URL: https://issues.apache.org/jira/browse/SPARK-3761 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Igor Tkachenko Priority: Blocker I have Scala code: val master = spark://server address:7077 val sc = new SparkContext(new SparkConf() .setMaster(master) .setAppName(SparkQueryDemo 01) .set(spark.executor.memory, 512m)) val count2 = sc .textFile(hdfs://server address:8020/tmp/data/risk/account.txt) .filter(line = line.contains(Word)) .count() I've got such an error: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host server address: java.lang.ClassNotFoundExcept ion: SimpleApp$$anonfun$1 My dependencies : object Version { val spark= 1.0.0-cdh5.1.0 } object Library { val sparkCore = org.apache.spark % spark-assembly_2.10 % Version.spark } My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3761: --- Priority: Critical (was: Major) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 - Key: SPARK-3761 URL: https://issues.apache.org/jira/browse/SPARK-3761 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Igor Tkachenko Priority: Critical I have Scala code: val master = spark://server address:7077 val sc = new SparkContext(new SparkConf() .setMaster(master) .setAppName(SparkQueryDemo 01) .set(spark.executor.memory, 512m)) val count2 = sc .textFile(hdfs://server address:8020/tmp/data/risk/account.txt) .filter(line = line.contains(Word)) .count() I've got such an error: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host server address: java.lang.ClassNotFoundExcept ion: SimpleApp$$anonfun$1 My dependencies : object Version { val spark= 1.0.0-cdh5.1.0 } object Library { val sparkCore = org.apache.spark % spark-assembly_2.10 % Version.spark } My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3806) minor bug in CliSuite
[ https://issues.apache.org/jira/browse/SPARK-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3806: --- Summary: minor bug in CliSuite (was: minor bug and exception in CliSuite) minor bug in CliSuite - Key: SPARK-3806 URL: https://issues.apache.org/jira/browse/SPARK-3806 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Clisuite throw exception as follows: Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175) at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162) at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73) at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself
[ https://issues.apache.org/jira/browse/SPARK-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3783: --- Assignee: Nathan Kronenfeld The type parameters for SparkContext.accumulable are inconsistent Accumulable itself Key: SPARK-3783 URL: https://issues.apache.org/jira/browse/SPARK-3783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Nathan Kronenfeld Assignee: Nathan Kronenfeld Priority: Minor Fix For: 1.2.0 Original Estimate: 10m Remaining Estimate: 10m SparkContext.accumulable takes type parameters [T, R] - and passes them to accumulable, in that order. Accumulable takes type parameters [R, T]. So T for SparkContext.accumulable corresponds with R for Accumulable and vice versa. Minor, but very confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself
[ https://issues.apache.org/jira/browse/SPARK-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3783. Resolution: Fixed Fix Version/s: 1.2.0 https://github.com/apache/spark/pull/2637 The type parameters for SparkContext.accumulable are inconsistent Accumulable itself Key: SPARK-3783 URL: https://issues.apache.org/jira/browse/SPARK-3783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Nathan Kronenfeld Priority: Minor Fix For: 1.2.0 Original Estimate: 10m Remaining Estimate: 10m SparkContext.accumulable takes type parameters [T, R] - and passes them to accumulable, in that order. Accumulable takes type parameters [R, T]. So T for SparkContext.accumulable corresponds with R for Accumulable and vice versa. Minor, but very confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3782) AkkaUtils directly using log4j
[ https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159846#comment-14159846 ] Patrick Wendell commented on SPARK-3782: Would you mind pasting the exact exception? Also, what happens if you set spark.akka.logLifecycleEvents to true - does that work as a workaround? AkkaUtils directly using log4j -- Key: SPARK-3782 URL: https://issues.apache.org/jira/browse/SPARK-3782 Project: Spark Issue Type: Bug Reporter: Martin Gilday AkkaUtils is calling setLevel on Logger from log4j. This causes issues when using another implementation of SLF4J such as logback as log4j-over-slf4j.jars implementation of this class does not contain this method on Logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations
[ https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3782: --- Summary: Direct use of log4j in AkkaUtils interferes with certain logging configurations (was: AkkaUtils directly using log4j) Direct use of log4j in AkkaUtils interferes with certain logging configurations Key: SPARK-3782 URL: https://issues.apache.org/jira/browse/SPARK-3782 Project: Spark Issue Type: Bug Reporter: Martin Gilday AkkaUtils is calling setLevel on Logger from log4j. This causes issues when using another implementation of SLF4J such as logback as log4j-over-slf4j.jars implementation of this class does not contain this method on Logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3314) Script creation of AMIs
[ https://issues.apache.org/jira/browse/SPARK-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3314. Resolution: Fixed Assignee: Patrick Wendell So I think the original goal of this JIRA was just to provide some way for users to re-create our AMI which has been solved by the work in spark-ec2. https://github.com/mesos/spark-ec2/blob/v3/create_image.sh Nick - maybe we can open a new issue with your work? It might be nice to see a design proposal for that as well. Script creation of AMIs --- Key: SPARK-3314 URL: https://issues.apache.org/jira/browse/SPARK-3314 Project: Spark Issue Type: Improvement Components: EC2 Reporter: holdenk Assignee: Patrick Wendell Priority: Minor The current Spark AMIs have been built up over time. It would be useful to provide a script which can be used to bootstrap from a fresh Amazon AMI. We could also update the AMIs in the project at the same time to be based on a newer version so we don't have to wait so long for the security updates to be installed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3694: --- Assignee: (was: Patrick Wendell) Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1190) Do not initialize log4j if slf4j log4j backend is not being used
[ https://issues.apache.org/jira/browse/SPARK-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1190: --- Assignee: Patrick Wendell (was: Patrick Cogan) Do not initialize log4j if slf4j log4j backend is not being used Key: SPARK-1190 URL: https://issues.apache.org/jira/browse/SPARK-1190 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 0.9.1, 1.0.0 I already have a patch here just need to test it and commit. IIRC there were some issues with the maven build. https://github.com/apache/incubator-spark/pull/573 https://github.com/pwendell/incubator-spark/commit/66594e88e5be50fca073a7ef38fa62db4082b3c8 initialization with: java.lang.StackOverflowError at java.lang.ThreadLocal.access$400(ThreadLocal.java:72) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:376) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:261) at java.lang.ThreadLocal.get(ThreadLocal.java:146) at java.lang.StringCoding.deref(StringCoding.java:63) at java.lang.StringCoding.encode(StringCoding.java:330) at java.lang.String.getBytes(String.java:916) at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.exists(File.java:813) at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080) at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1047) at sun.misc.URLClassPath.findResource(URLClassPath.java:176) at java.net.URLClassLoader$2.run(URLClassLoader.java:551) at java.net.URLClassLoader$2.run(URLClassLoader.java:549) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findResource(URLClassLoader.java:548) at java.lang.ClassLoader.getResource(ClassLoader.java:1147) at org.apache.spark.Logging$class.initializeLogging(Logging.scala:109) at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:97) at org.apache.spark.Logging$class.log(Logging.scala:36) at org.apache.spark.util.Utils$.log(Utils.scala:47) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3805) Enable Standalone worker cleanup by default
[ https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3805: --- Component/s: Deploy Enable Standalone worker cleanup by default --- Key: SPARK-3805 URL: https://issues.apache.org/jira/browse/SPARK-3805 Project: Spark Issue Type: Task Components: Deploy Reporter: Andrew Ash Now that SPARK-1860 is fixed we should be able to turn on {{spark.worker.cleanup.enabled}} by default -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3174: --- Component/s: YARN Spark Core Provide elastic scaling within a Spark application -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Attachments: SPARK-3174design.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. I think it would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Request more executors when RDDs can't fit in memory * Discard executors when few tasks are running / pending and there's not much in memory Bonus points: migrate blocks from executors we're about to discard to executors with free space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3765) Add test information to sbt build docs
[ https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3765: --- Summary: Add test information to sbt build docs (was: add testing with sbt to doc) Add test information to sbt build docs -- Key: SPARK-3765 URL: https://issues.apache.org/jira/browse/SPARK-3765 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
[ https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159998#comment-14159998 ] Venkata Ramana G commented on SPARK-3034: - Requires adding another data type DateType Modifications required in parser, datatype addition and DataType conversion to and from TimeStamp and String. Compatibility with Date supported in Hive 0.12.0. Date UDFs compatibility. Started working on the same. [HIve] java.sql.Date cannot be cast to java.sql.Timestamp - Key: SPARK-3034 URL: https://issues.apache.org/jira/browse/SPARK-3034 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.2 Reporter: pengyanhong Priority: Blocker run a simple HiveQL via yarn-cluster, got error as below: {quote} Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date cannot be cast to java.sql.Timestamp org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33) org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at
[jira] [Commented] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160002#comment-14160002 ] Patrick Wendell commented on SPARK-3731: [~davies] any chance you can take a look at this? RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, both in local mode or in standalone cluster mode Reporter: Milan Straka Assignee: Davies Liu Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3731: --- Priority: Critical (was: Major) RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, both in local mode or in standalone cluster mode Reporter: Milan Straka Assignee: Davies Liu Priority: Critical Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3731: --- Target Version/s: 1.1.1, 1.2.0 RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, both in local mode or in standalone cluster mode Reporter: Milan Straka Assignee: Davies Liu Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3731: --- Assignee: Davies Liu RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, both in local mode or in standalone cluster mode Reporter: Milan Straka Assignee: Davies Liu Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3742) Link to Spark UI sometimes fails when using H/A RM's
[ https://issues.apache.org/jira/browse/SPARK-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3742: --- Summary: Link to Spark UI sometimes fails when using H/A RM's (was: SparkUI page unable to open) Link to Spark UI sometimes fails when using H/A RM's Key: SPARK-3742 URL: https://issues.apache.org/jira/browse/SPARK-3742 Project: Spark Issue Type: Bug Components: YARN Reporter: meiyoula When running an application on yarn, the hyperlink on yarn page can't jump to sparkUI page. It happens sometimes. The error message is: This is standby RM. Redirecting to the current active RM: http://vm-181:8088/proxy/application_1409206382122_0037 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3732: --- Component/s: YARN Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3742) SparkUI page unable to open
[ https://issues.apache.org/jira/browse/SPARK-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3742: --- Component/s: (was: Spark Core) YARN SparkUI page unable to open --- Key: SPARK-3742 URL: https://issues.apache.org/jira/browse/SPARK-3742 Project: Spark Issue Type: Bug Components: YARN Reporter: meiyoula When running an application on yarn, the hyperlink on yarn page can't jump to sparkUI page. It happens sometimes. The error message is: This is standby RM. Redirecting to the current active RM: http://vm-181:8088/proxy/application_1409206382122_0037 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2885: --- Component/s: MLlib All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Reza Zadeh Assignee: Reza Zadeh Fix For: 1.2.0 Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2331: --- Summary: SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] (was: SparkContext.emptyRDD has wrong return type) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] -- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations
[ https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3782: --- Component/s: Spark Core Direct use of log4j in AkkaUtils interferes with certain logging configurations Key: SPARK-3782 URL: https://issues.apache.org/jira/browse/SPARK-3782 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Martin Gilday AkkaUtils is calling setLevel on Logger from log4j. This causes issues when using another implementation of SLF4J such as logback as log4j-over-slf4j.jars implementation of this class does not contain this method on Logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3694: --- Component/s: Spark Core Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2331: --- Component/s: Spark Core SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] -- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3694: --- Issue Type: Improvement (was: Bug) Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160022#comment-14160022 ] Apache Spark commented on SPARK-3568: - User 'coderxiang' has created a pull request for this issue: https://github.com/apache/spark/pull/2667 Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Include common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
[ https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Ramana G updated SPARK-3034: Comment: was deleted (was: Requires adding another data type DateType Modifications required in parser, datatype addition and DataType conversion to and from TimeStamp and String. Compatibility with Date supported in Hive 0.12.0. Date UDFs compatibility. Started working on the same.) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp - Key: SPARK-3034 URL: https://issues.apache.org/jira/browse/SPARK-3034 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.2 Reporter: pengyanhong Priority: Blocker run a simple HiveQL via yarn-cluster, got error as below: {quote} Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date cannot be cast to java.sql.Timestamp org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33) org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)