[jira] [Commented] (SPARK-3169) make-distribution.sh failed
[ https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105277#comment-14105277 ] Sean Owen commented on SPARK-3169: -- Same as https://issues.apache.org/jira/browse/SPARK-2798 ? it's resolving similar problems in the Flume build. make-distribution.sh failed --- Key: SPARK-3169 URL: https://issues.apache.org/jira/browse/SPARK-3169 Project: Spark Issue Type: Bug Components: Build Reporter: Guoqiang Li Priority: Blocker {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive -Dhadoop.version=2.3.0 {code} = {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in TestSuiteBase.class refers to term dstream in package org.apache.spark.streaming which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling TestSuiteBase.class. {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105303#comment-14105303 ] Sean Owen commented on SPARK-1449: -- [~pwendell] can you or someone else on the PMC zap this one? should be straightforward. Please delete old releases from mirroring system Key: SPARK-1449 URL: https://issues.apache.org/jira/browse/SPARK-1449 Project: Spark Issue Type: Task Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.9.1 Reporter: Sebb To reduce the load on the ASF mirrors, projects are required to delete old releases [1] Please can you remove all non-current releases? Thanks! [Note that older releases are always available from the ASF archive server] Any links to older releases on download pages should first be adjusted to point to the archive server. [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
[ https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109128#comment-14109128 ] Sean Owen commented on SPARK-3202: -- JIRA is not a good place to ask questions -- please use u...@spark.apache.org. This is for reporting issues, so I'd recommend closing this. Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD - Key: SPARK-3202 URL: https://issues.apache.org/jira/browse/SPARK-3202 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Hingorani, Vineet Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I want to calculate the maximum/average of a column. When I read the file using sc.textFile(test.csv).map(_.split(;), each field is read as string. Could someone help me with the above manipulation and how to do that. Or may be if there is some way to take the transpose of the data and then manipulating the rows in some way? Thank you in advance, I am struggling with this thing for quite sometime Regards, Vineet -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files
[ https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109675#comment-14109675 ] Sean Owen commented on SPARK-2798: -- [~tdas] Cool, I think this closes SPARK-3169 too if I understand correctly Correct several small errors in Flume module pom.xml files -- Key: SPARK-2798 URL: https://issues.apache.org/jira/browse/SPARK-2798 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink pom.xml - scalatest is not declared as a test-scope dependency - Its Avro version doesn't match the rest of the build - Its Flume version is not synced with the other Flume module - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version - It depends on Scala Lang directly, which it shouldn't -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110463#comment-14110463 ] Sean Owen commented on SPARK-3098: -- [~matei] The question isn't whether distinct returns a particular ordering, or whether zipWithIndex assigns particular indices, but whether they would result in the same ordering and same assignments every time the RDD is evaluated: {code} val c = {...}.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2) {code} If so, then the same values should map to the same indices, and the self-join of c to itself should always pair the same value with itself. Regardless of what those un-guaranteed values are they should be the same since it's the very same RDD. If not, obviously that explains the behavior then. The behavior at first glance had also surprised me, since I had taken RDDs to be deterministic and transparently recomputable on demand. That is the important first question -- is that supposed to be so or not? In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113704#comment-14113704 ] Sean Owen commented on SPARK-3276: -- Given the nature of a stream processing framework, when would you want to keep reprocessing all old data? that is something you can do, but, doesn't require Spark Streaming Provide a API to specify whether the old files need to be ignored in file input text DStream Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Jack Hu Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream
[ https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113702#comment-14113702 ] Sean Owen commented on SPARK-3274: -- Same as the problem and solution in https://issues.apache.org/jira/browse/SPARK-1040 Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream -- Key: SPARK-3274 URL: https://issues.apache.org/jira/browse/SPARK-3274 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2 Reporter: Jack Hu Reproduce code: scontext .socketTextStream(localhost, 1) .mapToPair(new PairFunctionString, String, String(){ public Tuple2String, String call(String arg0) throws Exception { return new Tuple2String, String(1, arg0); } }) .foreachRDD(new Function2JavaPairRDDString, String, Time, Void() { public Void call(JavaPairRDDString, String v1, Time v2) throws Exception { System.out.println(v2.toString() + : + v1.collectAsMap().toString()); return null; } }); Exception: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lscala.Tupl e2; at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s cala:447) at org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala: 464) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc V$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113710#comment-14113710 ] Sean Owen commented on SPARK-3266: -- The method is declared in the superclass, JavaRDDLike: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala#L538 You are running a different version of Spark than you are compiling with, and the runtime version is perhaps too old to contain this method. This is not a Spark issue. JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1 Reporter: Amey Chaugule While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114391#comment-14114391 ] Sean Owen commented on SPARK-3266: -- (Mea culpa! The example shows this is a legitimate question. I'll be quiet now.) JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3292) Shuffle Tasks run indefinitely even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114829#comment-14114829 ] Sean Owen commented on SPARK-3292: -- Can you elaborate this? it's not clear whether you're reporting that the process hangs, runs slowly, or creates too many files. Shuffle Tasks run indefinitely even though there's no inputs Key: SPARK-3292 URL: https://issues.apache.org/jira/browse/SPARK-3292 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: guowei such as repartition groupby join and cogroup it's too expensive , for example if i want outputs save as hadoop file ,then many emtpy file generate. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3302) The wrong version information in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115126#comment-14115126 ] Sean Owen commented on SPARK-3302: -- This duplicates at least one of https://issues.apache.org/jira/browse/SPARK-2697 or https://issues.apache.org/jira/browse/SPARK-3273 The wrong version information in SparkContext - Key: SPARK-3302 URL: https://issues.apache.org/jira/browse/SPARK-3302 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116278#comment-14116278 ] Sean Owen commented on SPARK-3324: -- I agree, I've had a similar problem and just resolved it manually. I imagine the answer is, soon we're going to delete alpha anyway and then this is moot. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recogniz codes in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and choose the version of yarn api via maven profile setting in the pom.xml. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3326) can't access a static variable after init in mapper
[ https://issues.apache.org/jira/browse/SPARK-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116327#comment-14116327 ] Sean Owen commented on SPARK-3326: -- The call to Foo.getSome() occurs remotely, on a different JVM with a different copy of your class. You may initialize your instance in the driver, but this leaves it uninitialized in the remote workers. You can initialize this in a static block. Or you can simply reference the value of Foo.getSome() directly in your map function and then it is serialized in the closure. All that you send right now is a function that depends on what Foo.getSome() returns when it's called, not what it happens to return on the driver. Consider broadcast variables if it's large. If that's what's going on then this is normal behavior. can't access a static variable after init in mapper --- Key: SPARK-3326 URL: https://issues.apache.org/jira/browse/SPARK-3326 Project: Spark Issue Type: Bug Environment: CDH5.1.0 Spark1.0.0 Reporter: Gavin Zhang I wrote a object like: object Foo { private Bar bar = null def init(Bar bar){ this.bar = bar } def getSome(){ bar.someDef() } } In Spark main def, I read some text from HDFS and init this object. And after then calling getSome(). I was successful with this code: sc.textFile(args(0)).take(10).map(println(Foo.getSome())) However, when I changed it for write output to HDFS, I found the bar variable in Foo object is null: sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1)) WHY? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116667#comment-14116667 ] Sean Owen commented on SPARK-3324: -- Let me try to sketch what's funky about the structure. We have yarn/alpha, yarn/common, yarn/stable. Understanding the purpose, I would expect each to be a module, and that each has a src/ directory, and that alpha and stable depend on common, and the Spark parent activates either yarn/alpha or yarn/stable depending on profiles. IntelliJ is fine with that. However what we have is that yarn/ is a module. But its source is in yarn/common. But it's a pom-only module. And yarn/alpha and yarn/stable list it as the parent and inherit all of their source directory info and dependencies from yarn/, which is not itself a module of code. So each compiles two source directories defined in different places. This plus profiles confused IntelliJ and required manual intervention. Maybe I overlook a reason this had to be done, but rejiggering this as three simple modules should work. Again I imagine the question is, is it worth it versus removing yarn/alpha at some point in the future? Because it's trivial to fix how IntelliJ reads the POMs once by hand in the IDE. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116801#comment-14116801 ] Sean Owen commented on SPARK-3324: -- [~tianyi] I seem to remember having a similar problem. I think that is straightforward to fix. It's a separate issue. But FWIW I would like to see that improved. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite
Sean Owen created SPARK-3330: Summary: Successive test runs with different profiles fail SparkSubmitSuite Key: SPARK-3330 URL: https://issues.apache.org/jira/browse/SPARK-3330 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for a while: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console One common cause is that on the second and subsequent runs of mvn clean test, at least two assembly JARs will exist in assembly/target. Because assembly is not a submodule of parent, mvn clean is not invoked for assembly. The presence of two assembly jars causes spark-submit to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3331) PEP8 tests fail in release because they check unzipped py4j code
Sean Owen created SPARK-3331: Summary: PEP8 tests fail in release because they check unzipped py4j code Key: SPARK-3331 URL: https://issues.apache.org/jira/browse/SPARK-3331 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Minor PEP8 tests run on files under ./python, but in the release packaging, py4j code is present in ./python/build/py4j. Py4J code fails style checks and thus release fails ./dev/run-tests now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3331) PEP8 tests fail because they check unzipped py4j code
[ https://issues.apache.org/jira/browse/SPARK-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3331: - Summary: PEP8 tests fail because they check unzipped py4j code (was: PEP8 tests fail in release because they check unzipped py4j code) PEP8 tests fail because they check unzipped py4j code - Key: SPARK-3331 URL: https://issues.apache.org/jira/browse/SPARK-3331 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Minor PEP8 tests run on files under ./python, but in the release packaging, py4j code is present in ./python/build/py4j. Py4J code fails style checks and thus release fails ./dev/run-tests now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3331) PEP8 tests fail because they check unzipped py4j code
[ https://issues.apache.org/jira/browse/SPARK-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3331: - Description: PEP8 tests run on files under ./python, but unzipped py4j code is found at ./python/build/py4j. Py4J code fails style checks and can fail ./dev/run-tests if this code is present locally. (was: PEP8 tests run on files under ./python, but in the release packaging, py4j code is present in ./python/build/py4j. Py4J code fails style checks and thus release fails ./dev/run-tests now.) PEP8 tests fail because they check unzipped py4j code - Key: SPARK-3331 URL: https://issues.apache.org/jira/browse/SPARK-3331 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Minor PEP8 tests run on files under ./python, but unzipped py4j code is found at ./python/build/py4j. Py4J code fails style checks and can fail ./dev/run-tests if this code is present locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3330. -- Resolution: Won't Fix It will be more suitable to change jenkins to run mvn clean mvn ... package to address this, if anything. It's not yet clear this is the cause of the failure in the same test in Jenkins though. Successive test runs with different profiles fail SparkSubmitSuite -- Key: SPARK-3330 URL: https://issues.apache.org/jira/browse/SPARK-3330 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for a while: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console One common cause is that on the second and subsequent runs of mvn clean test, at least two assembly JARs will exist in assembly/target. Because assembly is not a submodule of parent, mvn clean is not invoked for assembly. The presence of two assembly jars causes spark-submit to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
Sean Owen created SPARK-3369: Summary: Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-563) Run findBugs and IDEA inspections in the codebase
[ https://issues.apache.org/jira/browse/SPARK-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-563. - Resolution: Won't Fix This appears to be obsolete/stale too. Run findBugs and IDEA inspections in the codebase - Key: SPARK-563 URL: https://issues.apache.org/jira/browse/SPARK-563 Project: Spark Issue Type: Improvement Reporter: Ismael Juma I ran into a few instances of unused local variables and unnecessary usage of the 'return' keyword (the recommended practice is to avoid 'return' if possible) and thought it would be good to run findBugs and IDEA inspections to clean-up the code. I am willing to do this, but first would like to know whether you agree that this is a good idea and whether this is the right time to do it. These changes tend to affect many source files and can cause issues if there is major work ongoing in separate branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120314#comment-14120314 ] Sean Owen commented on SPARK-3384: -- [~rnowling] I think it fairly important for speed in that section of code though. Using mutable data structures is not a problem if done correctly and for the right reason. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)
[ https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-640. - Resolution: Fixed This looks stale right? Hadoop 1 version has been at 1.2.1 for some time. Update Hadoop 1 version to 1.1.0 (especially on AMIs) - Key: SPARK-640 URL: https://issues.apache.org/jira/browse/SPARK-640 Project: Spark Issue Type: New Feature Reporter: Matei Zaharia Hadoop 1.1.0 has a fix to the notorious trailing slash for directory objects in S3 issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be good to support on the AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-529) Have a single file that controls the environmental variables and spark config options
[ https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-529. - Resolution: Won't Fix This looks obsolete and/or fixed, as variables like SPARK_MEM are deprecated, and I suppose there is spark-env.sh too. Have a single file that controls the environmental variables and spark config options - Key: SPARK-529 URL: https://issues.apache.org/jira/browse/SPARK-529 Project: Spark Issue Type: Improvement Reporter: Reynold Xin E.g. multiple places in the code base uses SPARK_MEM and has its own default set to 512. We need a central place to enforce default values as well as documenting the variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1
Sean Owen created SPARK-3404: Summary: SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and only the command to stderr: {code} Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/bin/java -cp
[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121653#comment-14121653 ] Sean Owen commented on SPARK-3404: -- It's 100% repeatable in Maven for me locally, which seems to be Jenkins' experience too. I don't see the same problem with SBT (/dev/run-tests) locally, although I can't say I run that regularly. I could rewrite the SparkSubmitSuite to submit a JAR file that actually contains the class it's trying to invoke. Maybe that's smarter? the problem here seems to be the vagaries of what the run-time classpath is during an SBT vs Maven test. Would anyone second that? Separately it would probably not hurt to get in that change that logs stdout / stderr from the Utils method. SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Priority: Critical Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found
[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122921#comment-14122921 ] Sean Owen commented on SPARK-3369: -- The API change is unlikely to happen. Making a bunch of flatMap2 methods is really ugly. I suppose you could try wrapper the Iterator in this: {code} public class IteratorIterableT implements IterableT { private final IteratorT iterator; private boolean consumed; public IteratorIterable(IteratorT iterator) { this.iterator = iterator; } @Override public IteratorT iterator() { if (consumed) { throw new IllegalStateException(Iterator already consumed); } consumed = true; return iterator; } } {code} If, as I suspect, Spark actually only calls iterator() once, this will work, and this may be the most tolerable workaround until Spark 2.x. If it doesn't work, and iterator() is called multiple times, this will fail fast and at least we'd know. Can you try something like this? Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3442) Create LengthBoundedInputStream
[ https://issues.apache.org/jira/browse/SPARK-3442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125948#comment-14125948 ] Sean Owen commented on SPARK-3442: -- This exists in Guava as LimitInputStream until Guava 14, and as ByteStreams.limit() from 14 onwards: https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/io/LimitInputStream.java?spec=svn08d3526fc19293cf099e0c50dbf3bbc915f2e3f2r=08d3526fc19293cf099e0c50dbf3bbc915f2e3f2 http://docs.guava-libraries.googlecode.com/git-history/v14.0/javadoc/com/google/common/io/ByteStreams.html#limit(java.io.InputStream, long) It would be nice to reuse but you can see the version problem there. At least, maybe something that can be lifted and adapted. Create LengthBoundedInputStream --- Key: SPARK-3442 URL: https://issues.apache.org/jira/browse/SPARK-3442 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin To create a LengthBoundedInputStream, which is an InputStream decorator that limits the number of bytes returned from an underlying InputStream. This can be used to create an InputStream directly from a segment of a file (FileInputStream always reads till EOF). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3404. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Tests are now failing due to HiveQL test problems, but you can see they have passed SparkSubmitSuite: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ I think this one's resolved now. SparkSubmitSuite fails with spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Reporter: Sean Owen Priority: Critical Fix For: 1.1.1, 1.2.0 Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128232#comment-14128232 ] Sean Owen commented on SPARK-3470: -- If you implement {{AutoCloseable}}, then Spark will not work on Java 6, since this class does not exist before Java 7. Implementing {{Closeable}} is fine of course. I assume it would just call {{stop()}} Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3474) Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST
[ https://issues.apache.org/jira/browse/SPARK-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128300#comment-14128300 ] Sean Owen commented on SPARK-3474: -- (You can deprecate but still support old variable names, right? so SPARK_MASTER_IP has the effect of setting new SPARK_MASTER_HOST but generates a warning. You wouldn't want to or need to remove old vars immediately.) Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST Key: SPARK-3474 URL: https://issues.apache.org/jira/browse/SPARK-3474 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.1 Reporter: Chunjun Xiao There's some inconsistency regarding the env variable used to specify the spark master host server. In spark source code (MasterArguments.scala), the env variable is SPARK_MASTER_HOST, while in the shell script (e.g., spark-env.sh, start-master.sh), it's named SPARK_MASTER_IP. This will introduce an issue in some case, e.g., if spark master is started via service spark-master start, which is built based on latest bigtop (refer to bigtop/spark-master.svc). In this case, SPARK_MASTER_IP will have no effect. I suggest we change SPARK_MASTER_IP in the shell script to SPARK_MASTER_HOST. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128764#comment-14128764 ] Sean Owen commented on SPARK-3470: -- Spark retains compatibility with Java 6 on purpose AFAIK. But implementing Closeable is fine and also works with try-with-resources in Java 7, yes. Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-558) Simplify run script by relying on sbt to launch app
[ https://issues.apache.org/jira/browse/SPARK-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129798#comment-14129798 ] Sean Owen commented on SPARK-558: - Is this stale too? given that SBT is less used, I don't imagine the run scripts will start relying on it for classpath generation. Simplify run script by relying on sbt to launch app --- Key: SPARK-558 URL: https://issues.apache.org/jira/browse/SPARK-558 Project: Spark Issue Type: Improvement Reporter: Ismael Juma The run script replicates SBT's functionality in order to build the classpath. This could be avoided by creating a task in sbt that is responsible for calling the appropriate main method, configuring the environment variables from the script and then invoking sbt with the task name and arguments. Is there a reason why we should not do this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-683) Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation
[ https://issues.apache.org/jira/browse/SPARK-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-683. - Resolution: Fixed I think this is likely long since obsolete or fixed, since Spark, Hadoop and AMI Hadoop versions have moved forward, and have not heard of this issue in recent memory. Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation Key: SPARK-683 URL: https://issues.apache.org/jira/browse/SPARK-683 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 0.7.0 Reporter: Tathagata Das A simple saveAsObjectFile() leads to the following error. org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String, org.apache.hadoop.fs.permission.FsPermission, java.lang.String, boolean, boolean, short, long) at java.lang.Class.getMethod(Class.java:1622) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-880) When built with Hadoop2, spark-shell and examples don't initialize log4j properly
[ https://issues.apache.org/jira/browse/SPARK-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129803#comment-14129803 ] Sean Owen commented on SPARK-880: - This should be resolved/obsoleted by subsequent updates to SLF4J and log4j integration, and the props file. When built with Hadoop2, spark-shell and examples don't initialize log4j properly - Key: SPARK-880 URL: https://issues.apache.org/jira/browse/SPARK-880 Project: Spark Issue Type: Bug Reporter: Matei Zaharia They print this: {code} log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jEventHandler). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. {code} It might have to do with not finding a log4j.properties file. I believe hadoop1 had one in its own JARs (or depended on an older log4j that came with a default)? but hadoop2 doesn't. We should probably have our own default one in conf. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3520) java version check in spark-class fails with openjdk
[ https://issues.apache.org/jira/browse/SPARK-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133226#comment-14133226 ] Sean Owen commented on SPARK-3520: -- Duplicated / subsumed by https://issues.apache.org/jira/browse/SPARK-3425 java version check in spark-class fails with openjdk Key: SPARK-3520 URL: https://issues.apache.org/jira/browse/SPARK-3520 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Freebsd 10.1, Openjdk 7 Reporter: Radim Kolar Priority: Minor tested on current git master: (hsn@sanatana:pts/4):spark/bin% ./spark-shell /home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 1.7.0_65: integer expression expected (hsn@sanatana:pts/4):spark/bin% java -version openjdk version 1.7.0_65 OpenJDK Runtime Environment (build 1.7.0_65-b17) OpenJDK Server VM (build 24.65-b04, mixed mode) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven
[ https://issues.apache.org/jira/browse/SPARK-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133413#comment-14133413 ] Sean Owen commented on SPARK-3521: -- https://dist.apache.org/repos/dist/release/spark/spark-1.1.0/spark-1.1.0.tgz All of that source code is plainly in the distribution. It compiles with Maven for me an this was verified by several people during the release. It sounds like something is quite corrupted about your copy. Missing modules in 1.1.0 source distribution - cant be build with maven --- Key: SPARK-3521 URL: https://issues.apache.org/jira/browse/SPARK-3521 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Radim Kolar Priority: Minor modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from source code distro, spark cant be build with maven. It cant be build by {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1_) (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package [INFO] Scanning for projects... [ERROR] The build could not read 1 project - [Help 1] [ERROR] [ERROR] The project org.apache.spark:spark-parent:1.1.0 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133760#comment-14133760 ] Sean Owen commented on SPARK-3530: -- A few high-level questions: Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the algorithms will come along, but in a fairly different form. I think that's actually a good thing. But is this targeted at a 2.x release, or sooner? How does this relate to MLI and MLbase? I had thought they would in theory handle things like grid-search, but haven't seen activity or mention of these in a while. Is this at all a merge of the two or is MLlib going to take over these concerns? I don't think you will need or want to use this code, but the oryx project already has an implementation of grid search on Spark. At least another take on the API for such a thing to consider. https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also intrigued by doing better than trying every possible combination of parameters separately, and maybe sharing partial results to speed up several models' training. Is this realistic for any parameters besides things like # iterations? which isn't really a hyperparam. I don't know, for example, ways to build N models with N different overfitting params and share some work. I would love to know that's possible. Good to design for it anyway. I see mention of a Dataset abstraction, which I'm assuming contains some type information, like distinguishing categorical and numeric features. I think that's very good! I've always found the 'pipeline' part hard to build. It's tempting to construct a framework for feature extraction. To some degree you can by providing transformations, 1-hot encoding, etc. But I think that a framework for understanding arbitrary databases and fields and so on quickly becomes too endlessly large a scope. Spark Core to me is already the right abstraction for upstream ETL of data before entering an ML framework. I mention it just because it's in the first picture, but I don't see discussion of actually doing user/product attribute selection later. So maybe it's not meant to be part of the proposal. I'd certainly like to keep up more with your work here. This is a big step forward in making MLlib more relevant to production deployments rather than just pure algorithms implementations. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3470. -- Resolution: Fixed Fix Version/s: 1.2.0 Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor Fix For: 1.2.0 After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1895) Run tests on windows
[ https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133964#comment-14133964 ] Sean Owen commented on SPARK-1895: -- Can anyone still reproduce this? I know test temp file cleanup was improved in 1.0.x, and am not sure I have heard of this since. Run tests on windows Key: SPARK-1895 URL: https://issues.apache.org/jira/browse/SPARK-1895 Project: Spark Issue Type: Bug Components: PySpark, Windows Affects Versions: 0.9.1 Environment: spark-0.9.1-bin-hadoop1 Reporter: stribog Priority: Trivial bin\pyspark python\pyspark\rdd.py Sometimes tests complete without error _. Last tests fail log: 14/05/21 18:31:40 INFO Executor: Running task ID 321 14/05/21 18:31:40 INFO Executor: Running task ID 324 14/05/21 18:31:40 INFO Executor: Running task ID 322 14/05/21 18:31:40 INFO Executor: Running task ID 323 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 324 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on localhost (progress: 1/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3) 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 323 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on localhost (progress: 2/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 322 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on localhost (progress: 3/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 321 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on localhost (progress: 4/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks have all completed, from pool 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest __main__.RDD.top[0]:1) finished in 1,051 s 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest __main__.RDD.top[0]:1, took 1.053832912 s 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest __main__.RDD.top[1]:1 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false) 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List() 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List() 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1), which has no missing parents 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes in 1 ms 14/05/21 18:31:41 INFO Executor: Running task ID 326 14/05/21 18:31:41 INFO Executor:
[jira] [Resolved] (SPARK-1258) RDD.countByValue optimization
[ https://issues.apache.org/jira/browse/SPARK-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1258. -- Resolution: Won't Fix I'm taking the liberty of closing this, since this refers to an optimization using fastutil classes, which were removed from Spark. An equivalent optimization is employed now, using Spark's OpenHashMap. RDD.countByValue optimization - Key: SPARK-1258 URL: https://issues.apache.org/jira/browse/SPARK-1258 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Jaroslav Kamenik Priority: Trivial Class Object2LongOpenHashMap has method add(key, incr) (addTo in new version) for incrementation value assigned to the key. It should be faster than currently used map.put(v, map.getLong(v) + 1L) . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3506) 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest
[ https://issues.apache.org/jira/browse/SPARK-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133976#comment-14133976 ] Sean Owen commented on SPARK-3506: -- Yeah, I imagine that can be touched up right now. For the future, I imagine the issue was just that the site was built from the branch before the release plugin upped the version and created the artifacts? So the site might be better built from the final released source artifact. I imagine it's a release-process doc change but don't know whether that lives. 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest -- Key: SPARK-3506 URL: https://issues.apache.org/jira/browse/SPARK-3506 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Jacek Laskowski Priority: Trivial In https://spark.apache.org/docs/latest/ there are references to 1.1.0-SNAPSHOT: * This documentation is for Spark version 1.1.0-SNAPSHOT. * For the Scala API, Spark 1.1.0-SNAPSHOT uses Scala 2.10. It should be version 1.1.0 since that's the latest released version and the header tells so, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134096#comment-14134096 ] Sean Owen commented on SPARK-2620: -- FWIW, here is a mailing list comment that suggests 1.1 works with these case classes, although this is not a case where the REPL is being used: http://apache-spark-user-list.1001560.n3.nabble.com/Compiler-issues-for-multiple-map-on-RDD-td14248.html case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136899#comment-14136899 ] Sean Owen commented on SPARK-3563: -- I am no expert, but I believe this is on purpose, in order to reuse the shuffle if the RDD partition needs to be recomputed. You may need to set a lower spark.cleaner.ttl? Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142383#comment-14142383 ] Sean Owen commented on SPARK-3621: -- My understanding is that this is fairly fundamentally not possible in Spark. The metadata and machinery necessary to operate on RDDs is with the driver. RDDs are not accessible within transformations or actions. I'm interested both in whether that is in fact true, how much of an issue it really is for Hive-on-Spark to use collect + broadcast, and whether these sorts of requirements can be met with join, cogroup, etc. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143009#comment-14143009 ] Sean Owen commented on SPARK-3621: -- If the data is shipped to the worker node, and the driver is the thing that can marshal the data to be sent, how is it different from a Broadcast variable? the broadcast can be done efficiently with the torrent-based broadcast, for example. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143153#comment-14143153 ] Sean Owen commented on SPARK-3625: -- This prints 1000 both times for me, which is correct. When you say doesn't work, could you please elaborate? different count? exception? what is your environment? In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Blocker The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)) c.count c.checkpoint() c.count {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143342#comment-14143342 ] Sean Owen commented on SPARK-3625: -- It still prints 1000 both times, which is correct. Your assertion is about something different. The assertion fails, but, the behavior you are asserting is not what the javadoc suggests: {quote} Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation. {quote} This example calls count() before checkpoint(). If you don't, I think you get the expected behavior, since the dependency becomes a CheckpointRDD. This looks like not a bug. In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Blocker The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143724#comment-14143724 ] Sean Owen commented on SPARK-3431: -- It's trivial to configure Maven surefire/failsafe to execute tests in parallel. It can parallelize by class or method, fork or not, control number of concurrent forks as a multiple of cores, etc. For example, it's no problem to make test classes use their own JVM, and not even reuse JVMs if you don't want. The harder part is making the tests play nice with each other on one machine when it comes to shared resources: files and ports, really. I think the tests have had several passes of improvements to reliably use their own temp space, and try to use an unused port, but this is one typical cause of test breakage. It's not yet clear that tests don't clobber each other by trying to use the same default Spark working dir or something. Finally, some tests that depend on a certain sequence of random numbers may need to be made more robust. but the parallelization is trivial in Maven, at least. Parallelize execution of tests -- Key: SPARK-3431 URL: https://issues.apache.org/jira/browse/SPARK-3431 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common strategy to cut test time down is to parallelize the execution of the tests. Doing that may in turn require some prerequisite changes to be made to how certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143835#comment-14143835 ] Sean Owen commented on SPARK-3431: -- For your experiments, scalatest just copies an old subset of surefire's config: http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin vs http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html You can see discussion of how forkMode works: http://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html Bad news is that scalatest's support is much more limited, but parallel=true and forkMode=once might do the trick. Otherwise... I guess we can figure out if it's realistic to use standard surefire instead of scalatest. Parallelize execution of tests -- Key: SPARK-3431 URL: https://issues.apache.org/jira/browse/SPARK-3431 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common strategy to cut test time down is to parallelize the execution of the tests. Doing that may in turn require some prerequisite changes to be made to how certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3656) IllegalArgumentException when I using sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144588#comment-14144588 ] Sean Owen commented on SPARK-3656: -- Duplicate of https://issues.apache.org/jira/browse/SPARK-3032 This was discussed even today on the mailing list. IllegalArgumentException when I using sort-based shuffle Key: SPARK-3656 URL: https://issues.apache.org/jira/browse/SPARK-3656 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.1.0 Reporter: yangping wu Original Estimate: 8h Remaining Estimate: 8h The code work fine in hash-based shuffle. {code} sc.textFile(file:///export1/spark/zookeeper.out).flatMap(l = l.split( )).map(w=(w,1)).reduceByKey(_ + _).collect {code} But when I test the program using sort-based shuffle,the program encounters an error: {code} scala sc.textFile(file:///export1/spark/zookeeper.out).flatMap(l = l.split( )).map(w=(w,1)).reduceByKey(_ + _).collect org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 1.0 failed 1 times, most recent failure: Lost task 22.0 in stage 1.0 (TID 22, localhost): java.lang.IllegalArgumentException: Comparison method violates its general contract! org.apache.spark.util.collection.Sorter$SortState.mergeHi(Sorter.java:876) org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:495) org.apache.spark.util.collection.Sorter$SortState.mergeForceCollapse(Sorter.java:436) org.apache.spark.util.collection.Sorter$SortState.access$300(Sorter.java:294) org.apache.spark.util.collection.Sorter.sort(Sorter.java:137) org.apache.spark.util.collection.AppendOnlyMap.destructiveSortedIterator(AppendOnlyMap.scala:271) org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:212) org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:67) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:619) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example
[ https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146017#comment-14146017 ] Sean Owen commented on SPARK-3662: -- Maybe I miss something, but, does this just mean you can't import pandas entirely? If you're modifying the example, you should import only what you need from pandas. Or, it may be that you need to modify the import random, indeed, to accommodate other modifications you want to make. But what is the problem with the included example? it runs fine without modifications, no? Importing pandas breaks included pi.py example -- Key: SPARK-3662 URL: https://issues.apache.org/jira/browse/SPARK-3662 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.1.0 Environment: Xubuntu 14.04. Yarn cluster running on Ubuntu 12.04. Reporter: Evan Samanas If I add import pandas at the top of the included pi.py example and submit using spark-submit --master yarn-client, I get this stack trace: {code} Traceback (most recent call last): File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 39, in module count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce vals = self.mapPartitions(func).collect() File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect bytesInJava = self._jrdd.collect().iterator() File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) ImportError: No module named algos {code} The example works fine if I move the statement from random import random from the top and into the function (def f(_)) defined in the example. Near as I can tell, random is getting confused with a function of the same name within pandas.algos. Submitting the same script using --master local works, but gives a distressing amount of random characters to stdout or stderr and messes up my terminal: {code} ... @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ AJ AJ AJ AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23 15:42:09 INFO SparkContext: Job finished: reduce at /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 11.276879779 s J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ BJ BJ BJ BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJa. Pi is roughly 3.146136 {code} No idea if that's related, but thought I'd include it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error
[ https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146040#comment-14146040 ] Sean Owen commented on SPARK-3676: -- (For the interested, I looked it up, since the behavior change sounds surprising. This is in fact a bug in Java 6 that was fixed in Java 7: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022 It may even be fixed in later versions of Java 6, but I have a very recent one and it is not.) jdk version lead to spark sql test suite error -- Key: SPARK-3676 URL: https://issues.apache.org/jira/browse/SPARK-3676 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 System.out.println(1/500d) get different result in diff jdk version jdk 1.6.0(_31) 0.0020 jdk 1.7.0(_05) 0.002 this will lead to spark sql hive test suite failed (replay by set jdk version = 1.6.0_31)--- [info] - division *** FAILED *** [info] Results do not match for division: [info] SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 / COUNT(1)) AS c_3#695] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub leType) / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] Project [] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), DoubleType)) AS c_3#695] [info] Exchange SinglePartition [info] Aggregate true, [], [COUNT(1) AS PartialCount#699L] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0c_1 c_2 c_3 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !2.0 0.5 0. 0.002 2.0 0.5 0. 0.0020 (HiveComparisonTest.scala:370) [info] - timestamp cast #1 *** FAILED *** [info] Results do not match for timestamp cast #1: [info] SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !0.001 0.0010 (HiveComparisonTest.scala:370) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3586) Support nested directories in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3586: - Issue Type: Improvement (was: Bug) Summary: Support nested directories in Spark Streaming (was: spark streaming ) Support nested directories in Spark Streaming - Key: SPARK-3586 URL: https://issues.apache.org/jira/browse/SPARK-3586 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: wangxj Priority: Minor Labels: patch Fix For: 1.1.0 For text files, the method streamingContext.textFileStream(dataDirectory). Spark Streaming will monitor the directory dataDirectory and process any files created in that directory.but files written in nested directories not supported eg streamingContext.textFileStream(/test). Look at the direction contents: /test/file1 /test/file2 /test/dr/file1 In this mothod the textFileStream can only read file: /test/file1 /test/file2 /test/dr/ but the file: /test/dr/file1 is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3603) InvalidClassException on a Linux VM - probably problem with serialization
[ https://issues.apache.org/jira/browse/SPARK-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150700#comment-14150700 ] Sean Owen commented on SPARK-3603: -- Can you clarify what works and doesn't work -- works when both client and master use the same JVM, but, fails if they use different Linux JVMs? I'm wondering if it is actually going to be required to use the same JVMs, or at least, ones that generate serialVersionUID in the same way (which can be JVM-specific). Spark/Scala throw around so much serialized stuff that you notice pretty quickly, and setting a fixed serialVersionUID for the Scala classes is of course not feasible. You could try Kryo serialization too. InvalidClassException on a Linux VM - probably problem with serialization - Key: SPARK-3603 URL: https://issues.apache.org/jira/browse/SPARK-3603 Project: Spark Issue Type: Bug Affects Versions: 1.0.0, 1.1.0 Environment: Linux version 2.6.32-358.32.3.el6.x86_64 (mockbu...@x86-029.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Jan 17 08:42:31 EST 2014 java version 1.7.0_25 OpenJDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64) OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) Spark (either 1.0.0 or 1.1.0) Reporter: Tomasz Dudziak Priority: Critical Labels: scala, serialization, spark I have a Scala app connecting to a standalone Spark cluster. It works fine on Windows or on a Linux VM; however, when I try to run the app and the Spark cluster on another Linux VM (the same Linux kernel, Java and Spark - tested for versions 1.0.0 and 1.1.0) I get the below exception. This looks kind of similar to the Big-Endian (IBM Power7) Spark Serialization issue (SPARK-2018), but... my system is definitely little endian and I understand the big endian issue should be already fixed in Spark 1.1.0 anyway. I'd appreaciate your help. 01:34:53.251 WARN [Result resolver thread-0][TaskSetManager] Lost TID 2 (task 1.0:2) 01:34:53.278 WARN [Result resolver thread-0][TaskSetManager] Loss was due to java.io.InvalidClassException java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class incompatible: stream classdesc serialVersionUID = -4937928798201944954, local class serialVersionUID = -8102093212602380348 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at
[jira] [Resolved] (SPARK-1279) Stage.name return apply at Option.scala:120
[ https://issues.apache.org/jira/browse/SPARK-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1279. -- Resolution: Duplicate Obvious accidental dupe of SPARK-1280 Stage.name return apply at Option.scala:120 -- Key: SPARK-1279 URL: https://issues.apache.org/jira/browse/SPARK-1279 Project: Spark Issue Type: Bug Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2960. -- Resolution: Duplicate Fix Version/s: (was: 1.0.2) I suggest this be marked a dupe of https://issues.apache.org/jira/browse/SPARK-3482 as the latter appears to be the same and has an open PR. Spark executables fail to start via symlinks Key: SPARK-2960 URL: https://issues.apache.org/jira/browse/SPARK-2960 Project: Spark Issue Type: Bug Reporter: Shay Rojansky Priority: Minor The current scripts (e.g. pyspark) fail to run when they are executed via symlinks. A common Linux scenario would be to have Spark installed somewhere (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3504) KMeans clusterer is slow, can be sped up by 75%
[ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150709#comment-14150709 ] Sean Owen commented on SPARK-3504: -- Is this the same as https://issues.apache.org/jira/browse/SPARK-3424 ? The latter has a pull request, so should this resolve as duplicate in favor of the other? KMeans clusterer is slow, can be sped up by 75% --- Key: SPARK-3504 URL: https://issues.apache.org/jira/browse/SPARK-3504 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because recomputes all distances to all cluster centers on each iteration. In later iterations of Lloyd's algorithm, points don't change clusters and clusters don't move. By 1) tracking which clusters move and 2) tracking for each point which cluster it belongs to and the distance to that cluster, one can avoid recomputing distances in many cases with very little increase in memory requirements. I implemented this new algorithm and the results were fantastic. Using 16 c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here are the running times for the first 7 rounds: 6 minutes and 42 second 7 minutes and 7 seconds 7 minutes 13 seconds 1 minutes 18 seconds 30 seconds 18 seconds 12 seconds Without this improvement, all rounds would have taken roughly 7 minutes, resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words, this improvement resulting in a reduction of roughly 75% in running time with no loss of accuracy. My implementation is a rewrite of the existing 1.0.2 implementation. It is not a simple modification of the existing implementation. Please let me know if you are interested in this new implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3324. -- Resolution: Won't Fix I propose considering this WontFix, as it will go away when yarn-alpha goes away anyway, and that seems to be coming in the medium term. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Assignee: Patrick Wendell Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3195) Can you add some statistics to do logistic regression better in mllib?
[ https://issues.apache.org/jira/browse/SPARK-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3195: - Priority: Minor (was: Major) Target Version/s: (was: 1.3.0) Fix Version/s: (was: 1.3.0) Labels: (was: test) This sounds like a question more than a JIRA, and questions are better discussed on the mailing list. Can you clarify what you want, and propose a PR, or else close? Can you add some statistics to do logistic regression better in mllib? -- Key: SPARK-3195 URL: https://issues.apache.org/jira/browse/SPARK-3195 Project: Spark Issue Type: New Feature Components: MLlib Reporter: miumiu Priority: Minor Original Estimate: 1m Remaining Estimate: 1m HI, In logistic regression model practice,Test of regression coefficient and whole model fitting are very important.Can you add some effective support on these Aspects? Such as,The likelihood ratio test or the wald test is offer used for test of coefficient,and the Hosmer-Lemeshow test is used for evaluate the model fitting. Learning that we have ROC and Precision-Recall already,but can you also provide KS statistic,which is mostly used in Model evaluation aspect? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1046) Enable to build behind a proxy.
[ https://issues.apache.org/jira/browse/SPARK-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1046. -- Resolution: Fixed Enable to build behind a proxy. --- Key: SPARK-1046 URL: https://issues.apache.org/jira/browse/SPARK-1046 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.8.1 Reporter: Kousuke Saruta Priority: Minor I tried to build spark-0.8.1 behind proxy and failed although I set http/https.proxyHost, proxyPort, proxyUser, proxyPassword. I found it's caused by accessing github using git protocol (git://). The URL is hard-corded in SparkPluginBuild.scala as follows. {code} lazy val junitXmlListener = uri(git://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce) {code} After I rewrite the URL as follows, I could build successfully. {code} lazy val junitXmlListener = uri(https://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce;) {code} I think we should be able to build whether we are behind a proxy or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3101) Missing volatile annotation in ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3101. -- Resolution: Fixed This was actually subsumed by the commit for SPARK-2933 Missing volatile annotation in ApplicationMaster Key: SPARK-3101 URL: https://issues.apache.org/jira/browse/SPARK-3101 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta In ApplicationMaster, a field variable 'isLastAMRetry' is used as a flag but it's not declared as volatile though it's used from multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
[ https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2815. -- Resolution: Won't Fix Target Version/s: (was: 1.1.0) This looks like a WontFix, given the discussion. Compilation failed upon the hadoop version 2.0.0-cdh4.5.0 - Key: SPARK-2815 URL: https://issues.apache.org/jira/browse/SPARK-2815 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: pengyanhong Assignee: Guoqiang Li compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt assembly, finally get error message : [error] (yarn-stable/compile:compile) Compilation failed, the following is the detail error on console: [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.YarnClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40: not found: value YarnClient [error] val yarnClient = YarnClient.createYarnClient [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36: object util is not a member of package org.apache.hadoop.yarn.webapp [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64: value RM_AM_MAX_ATTEMPTS is not a member of object org.apache.hadoop.yarn.conf.YarnConfiguration [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66: not found: type AMRMClient [error] private var amClient: AMRMClient[ContainerRequest] = _ [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92: not found: value AMRMClient [error] amClient = AMRMClient.createAMRMClient() [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137: not found: value WebAppUtils [error] val proxy = WebAppUtils.getProxyHostAndPort(conf) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:618: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:596: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:577: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:410: value CONTAINER_ID is not a member of object org.apache.hadoop.yarn.api.ApplicationConstants.Environment [error] val containerIdString = System.getenv(ApplicationConstants.Environment.CONTAINER_ID.name()) [error] ^ [error]
[jira] [Commented] (SPARK-3652) upgrade spark sql hive version to 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150719#comment-14150719 ] Sean Owen commented on SPARK-3652: -- Yes, there is a lot to updating to 0.13 as you can see in the other JIRA, and the other seems like a cleaner approach. But even that may not be committed. I suggest resolving this one as a duplicate and trying to help the effort in SPARK-2706 instead, which is being used as a patch by some already evidently. upgrade spark sql hive version to 0.13.1 Key: SPARK-3652 URL: https://issues.apache.org/jira/browse/SPARK-3652 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.1.0 Reporter: wangfei now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2655) Change the default logging level to WARN
[ https://issues.apache.org/jira/browse/SPARK-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150740#comment-14150740 ] Sean Owen commented on SPARK-2655: -- Given the PR discussion, this is a WontFix? Change the default logging level to WARN Key: SPARK-2655 URL: https://issues.apache.org/jira/browse/SPARK-2655 Project: Spark Issue Type: Improvement Reporter: Davies Liu The current logging level INFO is pretty noisy, reduce these unnecessary logging will provide better experience for users. Currently, Spark is march stable and nature than before, so user will not need those much logging in normal cases. But some high level information will be helpful, such as messages about job and tasks progress, we could changes these important logging into WARN level as an hack, otherwise will need to change all other logging into level DEBUG. PS: it's better to have one line progress logging in terminal (also in title). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150746#comment-14150746 ] Sean Owen commented on SPARK-2331: -- Yes your analysis is right-er, I think. I imagine this won't be changed as it introduces an API change, but yes I think the return type should have been {{RDD[T]}} As a workaround you can do... {code} val empty: RDD[String] = sc.emptyRDD val rdds = Seq(a,b,c).foldLeft(empty){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} Anyone else? WontFix, at least for 1.x? SparkContext.emptyRDD has wrong return type --- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2247) Data frame (or Pandas) like API for structured data
[ https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150748#comment-14150748 ] Sean Owen commented on SPARK-2247: -- For pandas, is this what Sparkling Pandas provides? https://github.com/holdenk/sparklingpandas And for R, is this covered by SparkR? http://amplab-extras.github.io/SparkR-pkg/ Is this something that should therefore live outside Spark? Data frame (or Pandas) like API for structured data --- Key: SPARK-2247 URL: https://issues.apache.org/jira/browse/SPARK-2247 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Affects Versions: 1.0.0 Reporter: venu k tangirala Labels: features I would be nice to have R or python pandas like data frames on spark. 1) To be able to access the RDD data frame from python with pandas 2) To be able to access the RDD data frame from R 3) To be able to access the RDD data frame from scala's saddle -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2153) CassandraTest fails for newer Cassandra due to case insensitive key space
[ https://issues.apache.org/jira/browse/SPARK-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2153: - Summary: CassandraTest fails for newer Cassandra due to case insensitive key space (was: Spark Examples) CassandraTest fails for newer Cassandra due to case insensitive key space - Key: SPARK-2153 URL: https://issues.apache.org/jira/browse/SPARK-2153 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.0.0 Reporter: vishnu Priority: Minor Labels: examples Fix For: 1.0.0 Original Estimate: 12h Remaining Estimate: 12h The Spark Example CassandraTest.scala does cannot be built on newer versions of cassandra. I tried it on Cassandra 2.0.8. It is because Cassandra looks case sensitive for the key spaces and stores all the keyspaces in lowercase. And in the example the KeySpace is casDemo . So the program fails with an error stating keyspace not found. The new Cassandra jars do not have the org.apache.cassandra.db.IColumn .So instead we have to use org.apache.cassandra.db.Column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1895) Run tests on windows
[ https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1895. -- Resolution: Cannot Reproduce Run tests on windows Key: SPARK-1895 URL: https://issues.apache.org/jira/browse/SPARK-1895 Project: Spark Issue Type: Bug Components: PySpark, Windows Affects Versions: 0.9.1 Environment: spark-0.9.1-bin-hadoop1 Reporter: stribog Priority: Trivial bin\pyspark python\pyspark\rdd.py Sometimes tests complete without error _. Last tests fail log: {noformat} 14/05/21 18:31:40 INFO Executor: Running task ID 321 14/05/21 18:31:40 INFO Executor: Running task ID 324 14/05/21 18:31:40 INFO Executor: Running task ID 322 14/05/21 18:31:40 INFO Executor: Running task ID 323 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 324 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on localhost (progress: 1/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3) 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 323 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on localhost (progress: 2/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 322 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on localhost (progress: 3/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 321 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on localhost (progress: 4/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks have all completed, from pool 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest __main__.RDD.top[0]:1) finished in 1,051 s 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest __main__.RDD.top[0]:1, took 1.053832912 s 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest __main__.RDD.top[1]:1 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false) 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List() 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List() 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1), which has no missing parents 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes in 1 ms 14/05/21 18:31:41 INFO Executor: Running task ID 326 14/05/21 18:31:41 INFO Executor: Running task ID 328 14/05/21 18:31:41 INFO Executor: Running task ID 327 14/05/21 18:31:41 INFO Executor: Running task ID 325 14/05/21
[jira] [Commented] (SPARK-2517) Remove as many compilation warning messages as possible
[ https://issues.apache.org/jira/browse/SPARK-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150753#comment-14150753 ] Sean Owen commented on SPARK-2517: -- [~rxin] I think you resolved this? I don't see these warnings anymore. (Hurrah.) Remove as many compilation warning messages as possible --- Key: SPARK-2517 URL: https://issues.apache.org/jira/browse/SPARK-2517 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Yin Huai Priority: Minor We should probably treat warnings as failures in Jenkins. Some examples: {code} [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138: abstract type ExpectedAppender is unchecked since it is eliminated by erasure [warn] assert(appender.isInstanceOf[ExpectedAppender]) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143: abstract type ExpectedRollingPolicy is unchecked since it is eliminated by erasure [warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy] [warn] ^ {code} {code} [warn] /scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386: method connect in class IOManager is deprecated: use the new implementation in package akka.io instead [warn] override def preStart = IOManager(context.system).connect(new InetSocketAddress(port)) [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case (key: String, struct: Map[String, Any]) = { [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by erasure [warn] case map: java.util.Map[String, Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243: non-variable type argument Object in type pattern java.util.List[Object] is unchecked since it is eliminated by erasure [warn] case list: java.util.List[Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case value: Map[String, Any] = toJsonObjectString(value) [warn] ^ [info] Compiling 2 Scala sources to /scratch/rxin/spark/repl/target/scala-2.10/test-classes... [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] val randoms = ones.mapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400: method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex and flatMap [warn] val randoms = ones.flatMapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] val sample = ints.filterWith( [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29: class Vector in package util is deprecated: Use Vectors.dense from Spark's mllib.linalg package instead. [warn] def verifyVector(vector: Vector, expectedLength: Int) = { [warn]^ [warn] one warning found {code} {code} [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by
[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8
[ https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150778#comment-14150778 ] Sean Owen commented on SPARK-3359: -- Yeah I noticed this. The problem is that {{sbt-unidoc}} uses {{genjavadoc}}, and it looks like it generates invalid Java like the snippet you quote (top-level classes can't be private). That's almost all of the extra warnings. It seems to have been fixed in {{genjavadoc}} 0.8: https://github.com/typesafehub/genjavadoc/blob/v0.8/plugin/src/main/scala/com/typesafe/genjavadoc/AST.scala#L107 I can see how to update the plugin in the Maven build, but not yet in the SBT build. If someone who gets SBT can explain how to set {{unidocGenjavadocVersion}} to 0.8 in the {{genjavadocSettings}} that is inherited in {{project/SparkBuild.scala}}, I bet that would fix it. https://github.com/sbt/sbt-unidoc/blob/master/src/main/scala/sbtunidoc/Plugin.scala#L22 `sbt/sbt unidoc` doesn't work with Java 8 - Key: SPARK-3359 URL: https://issues.apache.org/jira/browse/SPARK-3359 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Xiangrui Meng Priority: Minor It seems that Java 8 is stricter on JavaDoc. I got many error messages like {code} [error] /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2: error: modifier private not allowed here [error] private abstract interface SparkHadoopMapRedUtil { [error] ^ {code} This is minor because we can always use Java 6/7 to generate the doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151422#comment-14151422 ] Sean Owen commented on SPARK-3714: -- Another meta-question for everyone: at what point should a project like this simply be a separate add-on project? For example, Oozie is a stand-alone project. Not everything needs to happen directly under the Spark umbrella, which is already broad. One upside to including it is that Spark is it perhaps gets more attention. Spark is forced to maintain and keep it compatible, which is also a downside I suppose. There is also the effect that you create an official workflow engine and discourage others. I am more asking the question than suggesting an answer, but, my reaction was that this could live outside Spark just fine. Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream
[ https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151432#comment-14151432 ] Sean Owen commented on SPARK-3274: -- I don't think that's the same thing. It is just saying you are reading a {{SequenceFile}} of {{Text}} and then pretending they are Strings. Are you sure the first {{return}} statement works? They will both work as expected if you just call {{.toString()}} on the {{Text}} objects you are actually operating on. Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream -- Key: SPARK-3274 URL: https://issues.apache.org/jira/browse/SPARK-3274 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2 Reporter: Jack Hu Reproduce code: scontext .socketTextStream(localhost, 1) .mapToPair(new PairFunctionString, String, String(){ public Tuple2String, String call(String arg0) throws Exception { return new Tuple2String, String(1, arg0); } }) .foreachRDD(new Function2JavaPairRDDString, String, Time, Void() { public Void call(JavaPairRDDString, String v1, Time v2) throws Exception { System.out.println(v2.toString() + : + v1.collectAsMap().toString()); return null; } }); Exception: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lscala.Tupl e2; at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s cala:447) at org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala: 464) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc V$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2159) Spark shell exit() does not stop SparkContext
[ https://issues.apache.org/jira/browse/SPARK-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2159. -- Resolution: Won't Fix Fix Version/s: (was: 1.2.0) The discussion in the PR suggests this is WontFix. https://github.com/apache/spark/pull/1230#issuecomment-54045637 Spark shell exit() does not stop SparkContext - Key: SPARK-2159 URL: https://issues.apache.org/jira/browse/SPARK-2159 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Priority: Minor If you type exit() in spark shell, it is equivalent to a Ctrl+C and does not stop the SparkContext. This is used very commonly to exit a shell, and it would be good if it is equivalent to Ctrl+D instead, which does stop the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2643) Stages web ui has ERROR when pool name is None
[ https://issues.apache.org/jira/browse/SPARK-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2643. -- Resolution: Fixed Discussion suggests this was fixed by a related change: https://github.com/apache/spark/pull/1854#issuecomment-55061571 Stages web ui has ERROR when pool name is None -- Key: SPARK-2643 URL: https://issues.apache.org/jira/browse/SPARK-2643 Project: Spark Issue Type: Bug Components: Web UI Reporter: YanTang Zhai Priority: Minor 14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132) at org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40) at org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60) at org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52) at org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at
[jira] [Resolved] (SPARK-1208) after some hours of working the :4040 monitoring UI stops working.
[ https://issues.apache.org/jira/browse/SPARK-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1208. -- Resolution: Fixed This appears to be a similar, if not the same issue, as in SPARK-2643. The discussion in the PR indicates this was resolved by a subsequent change: https://github.com/apache/spark/pull/1854#issuecomment-55061571 after some hours of working the :4040 monitoring UI stops working. -- Key: SPARK-1208 URL: https://issues.apache.org/jira/browse/SPARK-1208 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Reporter: Tal Sliwowicz This issue is inconsistent, but it did not exist in prior versions. The Driver app otherwise works normally. The log file below is from the driver. 2014-03-09 07:24:55,837 WARN [qtp1187052686-17453] AbstractHttpConnection - /stages/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StageTable.org$apache$spark$ui$jobs$StageTable$$stageRow(StageTable.scala:114) at org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39) at org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39) at org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57) at org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.ui.jobs.StageTable.stageTable(StageTable.scala:57) at org.apache.spark.ui.jobs.StageTable.toNodeSeq(StageTable.scala:39) at org.apache.spark.ui.jobs.IndexPage.render(IndexPage.scala:81) at org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59) at org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59) at org.apache.spark.ui.JettyUtils$$anon$1.handle(JettyUtils.scala:61) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1040) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:976) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:363) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:483) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:920) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:982) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:662) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3203) ClassNotFoundException in spark-shell with Cassandra
[ https://issues.apache.org/jira/browse/SPARK-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3203: - Summary: ClassNotFoundException in spark-shell with Cassandra (was: ClassNotFound Exception) ClassNotFoundException in spark-shell with Cassandra Key: SPARK-3203 URL: https://issues.apache.org/jira/browse/SPARK-3203 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Ubuntu 12.04, openjdk 64 bit 7u65 Reporter: Rohit Kumar I am using Spark with as processing engine over cassandra. I have only one master and a worker node. I am executing following code in spark-shell : sc.stop import org.apache.spark.SparkContext import org.apache.spark.SparkConf import com.datastax.spark.connector._ val conf = new SparkConf(true).set(spark.cassandra.connection.host, 127.0.0.1) val sc = new SparkContext(spark://L-BXP44Z1:7077, Cassandra Connector Test, conf) val rdd = sc.cassandraTable(test, kv) println(rdd.map(_.getInt(value)).sum) I am getting following error: 14/08/25 18:47:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/08/25 18:49:39 INFO CoarseGrainedExecutorBackend: Got assigned task 0 14/08/25 18:49:39 INFO Executor: Running task ID 0 14/08/25 18:49:39 ERROR Executor: Exception in task ID 0 java.lang.ClassNotFoundException: $line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
[jira] [Updated] (SPARK-1381) Spark to Shark direct streaming
[ https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1381: - Priority: Major (was: Blocker) It sounds like this is WontFix at this point, if there was a problem to begin with, as Shark is deprecated. Spark to Shark direct streaming --- Key: SPARK-1381 URL: https://issues.apache.org/jira/browse/SPARK-1381 Project: Spark Issue Type: Question Components: Documentation, Examples, Input/Output, Java API, Spark Core Affects Versions: 0.8.1 Reporter: Abhishek Tripathi Labels: performance Hi, I'm trying to push data coming from Spark streaming to Shark cache table. I thought of using JDBC API but Shark(0.81) does not support direct insert statement i.e insert into emp values(2, Apia) . I don't want to store Spark streaming into HDFS and then copy that data to Shark table. Can somebody plz help 1. how can I directly point Spark streaming data to Shark table/cachedTable ? otherway how can Shark pickup data directly from Spark streaming? 2. Does Shark0.81 has direct insert statement without referring to other table? It is really stopping us to use Spark further more. need your assistant urgently. Thanks in advance. Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1381) Spark to Shark direct streaming
[ https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1381. -- Resolution: Won't Fix Spark to Shark direct streaming --- Key: SPARK-1381 URL: https://issues.apache.org/jira/browse/SPARK-1381 Project: Spark Issue Type: Question Components: Documentation, Examples, Input/Output, Java API, Spark Core Affects Versions: 0.8.1 Reporter: Abhishek Tripathi Labels: performance Hi, I'm trying to push data coming from Spark streaming to Shark cache table. I thought of using JDBC API but Shark(0.81) does not support direct insert statement i.e insert into emp values(2, Apia) . I don't want to store Spark streaming into HDFS and then copy that data to Shark table. Can somebody plz help 1. how can I directly point Spark streaming data to Shark table/cachedTable ? otherway how can Shark pickup data directly from Spark streaming? 2. Does Shark0.81 has direct insert statement without referring to other table? It is really stopping us to use Spark further more. need your assistant urgently. Thanks in advance. Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1313) Shark- JDBC driver
[ https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1313: - Priority: Minor (was: Blocker) Issue Type: Question (was: Task) Shark- JDBC driver --- Key: SPARK-1313 URL: https://issues.apache.org/jira/browse/SPARK-1313 Project: Spark Issue Type: Question Components: Documentation, Examples, Java API Reporter: Abhishek Tripathi Priority: Minor Labels: Hive,JDBC, Shark, Hi, I'm trying to get JDBC/any driver that can connect to Shark using Java and execute the Shark/hive query. Can you plz advise if such connector/driver is available ? Thanks Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1313) Shark- JDBC driver
[ https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1313. -- Resolution: Not a Problem This looks like it was a question more than anything, and was answered. Shark- JDBC driver --- Key: SPARK-1313 URL: https://issues.apache.org/jira/browse/SPARK-1313 Project: Spark Issue Type: Question Components: Documentation, Examples, Java API Reporter: Abhishek Tripathi Priority: Minor Labels: Hive,JDBC, Shark, Hi, I'm trying to get JDBC/any driver that can connect to Shark using Java and execute the Shark/hive query. Can you plz advise if such connector/driver is available ? Thanks Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1884) Shark failed to start
[ https://issues.apache.org/jira/browse/SPARK-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1884. -- Resolution: Won't Fix This appears to be a protobuf version mismatch, which suggests Shark is being used with an unsupported version of Hadoop. As Shark is deprecated and unlikely to take steps to support anything else -- and because there is a sort of clear path to workaround here if one cared to -- I think this is a WontFix too? Shark failed to start - Key: SPARK-1884 URL: https://issues.apache.org/jira/browse/SPARK-1884 Project: Spark Issue Type: Bug Affects Versions: 0.9.1 Environment: ubuntu 14.04, spark 0.9.1, hive 0.13.0, hadoop 2.4.0 (stand alone), scala 2.11.0 Reporter: Wei Cui Priority: Blocker the hadoop, hive, spark works fine. when start the shark, it failed with the following messages: Starting the Shark Command Line Client 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.cluster.temp.dir; Ignoring. Logging initialized using configuration in jar:file:/usr/local/shark/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties Hive history file=/tmp/root/hive_job_log_root_14857@ubuntu_201405191647_897494215.txt 6.004: [GC 279616K-18440K(1013632K), 0.0438980 secs] 6.445: [Full GC 59125K-7949K(1013632K), 0.0685160 secs] Reloading cached RDDs from previous Shark sessions... (use -skipRddReload flag to skip reloading) 7.535: [Full GC 104136K-13059K(1013632K), 0.0885820 secs] 8.459: [Full GC 61237K-18031K(1013632K), 0.0820400 secs] 8.662: [Full GC 29832K-8958K(1013632K), 0.0869700 secs] 8.751: [Full GC 13433K-8998K(1013632K), 0.0856520 secs] 10.435: [Full GC 72246K-14140K(1013632K), 0.1797530 secs] Exception in thread main org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072) at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49) at shark.SharkCliDriver.init(SharkCliDriver.scala:283) at shark.SharkCliDriver$.main(SharkCliDriver.scala:162) at shark.SharkCliDriver.main(SharkCliDriver.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:51) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070) ... 4 more Caused by:
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152050#comment-14152050 ] Sean Owen commented on SPARK-3725: -- Yes of course, it's already in the repo and has been for a while. It was just renamed with a redirect from the old URL. But, that update hasn't hit the public site yet. Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor Original Estimate: 1m Remaining Estimate: 1m The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152138#comment-14152138 ] Sean Owen commented on SPARK-3725: -- No, that links to the raw markdown. Truly, the fix is to rebuild the site. The source is fine. Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor Original Estimate: 1m Remaining Estimate: 1m The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3730) Any one else having building spark recently
[ https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152140#comment-14152140 ] Sean Owen commented on SPARK-3730: -- (The profile is hadoop-2.3 but that's not the issue.) I have seen this too and it's a {{scalac}} bug as far as I can tell, as you can see from the stack trace. It's not a Spark issue. Any one else having building spark recently --- Key: SPARK-3730 URL: https://issues.apache.org/jira/browse/SPARK-3730 Project: Spark Issue Type: Question Reporter: Anant Daksh Asthana Priority: Minor I get an assertion error in spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to build. I am building using mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152912#comment-14152912 ] Sean Owen commented on SPARK-3732: -- FWIW, I was also surprised in the past that there is no way to submit a job programmatically. That would be great for embedding Spark. An option seems like big overkill here. System.exit() is not a great idea in general and I agree that removing it is better. I can confirm that the JVM exit status is 0 on success and 1 on exception anyway, so this doesn't even change semantics. That is, you still get exit status 1 if an exception is thrown. The stack trace is also printed, so the printing the exception is also redundant and the try block can go. Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156311#comment-14156311 ] Sean Owen commented on SPARK-3764: -- This is correct and as intended. Without any additional flags, yes, the version of Hadoop referenced by Spark would be 1.0.4. You should not rely on this though. If your app uses Spark but not Hadoop, it's not relevant as you are not packaging Spark or Hadoop dependencies in your app. If you use Spark and Hadoop APIs, you need to explicitly depend on the version of Hadoop you use on your cluster (but still not bundle with your app). Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2809) update chill to version 0.5.0
[ https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156362#comment-14156362 ] Sean Owen commented on SPARK-2809: -- PS chill 0.5.0 is the first to support Scala 2.11, so now this is actionable. http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22chill_2.11%22 update chill to version 0.5.0 - Key: SPARK-2809 URL: https://issues.apache.org/jira/browse/SPARK-2809 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Assignee: Guoqiang Li First twitter chill_2.11 0.4 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
[ https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477 ] Sean Owen commented on SPARK-1834: -- Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am investigating. It compiles fine but fails at runtime. I sense Scala shenanigans. NoSuchMethodError when invoking JavaPairRDD.reduce() in Java Key: SPARK-1834 URL: https://issues.apache.org/jira/browse/SPARK-1834 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4 Reporter: John Snodgrass I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here is the partial stack trace: Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)... I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same version of Spark as I am running on the cluster. The reduce() method works fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that exhibits the problem: ArrayListInteger array = new ArrayList(); for (int i = 0; i 10; ++i) { array.add(i); } JavaRDDInteger rdd = javaSparkContext.parallelize(array); JavaPairRDDString, Integer testRDD = rdd.map(new PairFunctionInteger, String, Integer() { @Override public Tuple2String, Integer call(Integer t) throws Exception { return new Tuple2( + t, t); } }).cache(); testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, Integer, Tuple2String, Integer() { @Override public Tuple2String, Integer call(Tuple2String, Integer arg0, Tuple2String, Integer arg1) throws Exception { return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2); } }); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
[ https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477 ] Sean Owen edited comment on SPARK-1834 at 10/2/14 12:46 PM: Weird, I can reproduce this. It compiles fine but fails at runtime. Another example, that doesn't even use lambdas: {code} @Test public void pairReduce() { JavaRDDInteger rdd = sc.parallelize(Arrays.asList(1, 1, 2, 3, 5, 8, 13)); JavaPairRDDInteger,Integer pairRDD = rdd.mapToPair( new PairFunctionInteger, Integer, Integer() { @Override public Tuple2Integer, Integer call(Integer i) { return new Tuple2Integer, Integer(i, i + 1); } }); // See SPARK-1834 Tuple2Integer, Integer reduced = pairRDD.reduce( new Function2Tuple2Integer,Integer, Tuple2Integer,Integer, Tuple2Integer,Integer() { @Override public Tuple2Integer, Integer call(Tuple2Integer, Integer t1, Tuple2Integer, Integer t2) { return new Tuple2Integer, Integer(t1._1() + t2._1(), t1._2() + t2._2()); } }); Assert.assertEquals(33, reduced._1().intValue()); Assert.assertEquals(40, reduced._1().intValue()); } {code} but... {code} java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; {code} I decompiled the class and it really looks like the method is there with the expected signature: {code} public scala.Tuple2K, V reduce(org.apache.spark.api.java.function.Function2scala.Tuple2K, V, scala.Tuple2K, V, scala.Tuple2K, V); {code} Color me pretty confused. was (Author: srowen): Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am investigating. It compiles fine but fails at runtime. I sense Scala shenanigans. NoSuchMethodError when invoking JavaPairRDD.reduce() in Java Key: SPARK-1834 URL: https://issues.apache.org/jira/browse/SPARK-1834 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4 Reporter: John Snodgrass I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here is the partial stack trace: Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)... I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same version of Spark as I am running on the cluster. The reduce() method works fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that exhibits the problem: ArrayListInteger array = new ArrayList(); for (int i = 0; i 10; ++i) { array.add(i); } JavaRDDInteger rdd = javaSparkContext.parallelize(array); JavaPairRDDString, Integer testRDD = rdd.map(new PairFunctionInteger, String, Integer() { @Override public Tuple2String, Integer call(Integer t) throws Exception { return new Tuple2( + t, t); } }).cache(); testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, Integer, Tuple2String, Integer() { @Override public Tuple2String, Integer call(Tuple2String, Integer arg0, Tuple2String, Integer arg1) throws Exception { return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2); } }); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156581#comment-14156581 ] Sean Owen commented on SPARK-3764: -- The artifacts themselves don't contain any Hadoop code. The default disposition of the pom would link to Hadoop 1, but apps are not meant to depend on this (this is generally good Maven practice). Yes, you always need to add Hadoop dependencies if you use Hadoop APIs. That's not specific to Spark. In fact, you will want to mark Spark and Hadoop as provided dependencies when making an app for use with spark-submit. You can use the Spark artifacts to build a Spark app that works with Hadoop 2 or Hadoop 1. The instructions you see are really about creating a build of Spark itself to deploy on a cluster, rather than an app for Spark. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156723#comment-14156723 ] Sean Owen commented on SPARK-3764: -- I'm not sure what you mean. Spark compiles versus most versions of Hadoop 1 and 2. You can see the profiles in the build that help support this. These are however not relevant to someone that is just building a Spark app. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path
[ https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157156#comment-14157156 ] Sean Owen commented on SPARK-3769: -- My understanding is that you execute: {code} sc.addFile(/opt/tom/SparkFiles.sas); ... SparkFiles.get(SparkFiles.sas); {code} I would not expect the key used by remote workers must be aware of the location on the driver that the file came from. The path may not be absolute in all cases anyway. I can see the argument that it feels like both should be the same key but really the key being set is the file name, not path. You don't have to parse it by hand though. Usually you might do something like this anyway: {code} File myFile = new File(args[1]); sc.addFile(myFile.getAbsolutePath()); String fileName = myFile.getName(); ... SparkFiles.get(fileName); {code} AFAIK this is as intended. SparkFiles.get gives me the wrong fully qualified path -- Key: SPARK-3769 URL: https://issues.apache.org/jira/browse/SPARK-3769 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Environment: linux host, and linux grid. Reporter: Tom Weber Priority: Minor My spark pgm running on my host, (submitting work to my grid). JavaSparkContext sc =new JavaSparkContext(conf); final String path = args[1]; sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */ The log shows: 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986 those are paths on my host machine. The location that this file gets on grid nodes is: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas While the call to get the path in my code that runs in my mapPartitions function on the grid nodes is: String pgm = SparkFiles.get(path); And this returns the following string: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas So, am I expected to take the qualified path that was given to me and parse it to get only the file name at the end, and then concatenate that to what I get from the SparkFiles.getRootDirectory() call in order to get this to work? Or pass only the parsed file name to the SparkFiles.get method? Seems as though I should be able to pass the same file specification to both sc.addFile() and SparkFiles.get() and get the correct location of the file. Thanks, Tom -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3764. -- Resolution: Not a Problem Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3794) Building spark core fails with specific hadoop version
[ https://issues.apache.org/jira/browse/SPARK-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159524#comment-14159524 ] Sean Owen commented on SPARK-3794: -- The real problem here is that commons-io is not a dependency of Spark, and should not be, but it began to be used a few days ago in commit https://github.com/apache/spark/commit/cf1d32e3e1071829b152d4b597bf0a0d7a5629a2 for SPARK-1860. So it is accidentally depending on the version of Commons IO brought in by third party dependencies. I will propose a PR that removes this usage in favor of Guava or Java APIs. Building spark core fails with specific hadoop version -- Key: SPARK-3794 URL: https://issues.apache.org/jira/browse/SPARK-3794 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5 Reporter: cocoatomo Labels: spark At the commit cf1d32e3e1071829b152d4b597bf0a0d7a5629a2, building spark core result in compilation error when we specify some hadoop versions. To reproduce this issue, we should execute following command with hadoop.version=1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, or 2.2.0. {noformat} $ cd ./core $ mvn -Dhadoop.version=hadoop.version -DskipTests clean compile ... [ERROR] /Users/tomohiko/MyRepos/Scala/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:720: value listFilesAndDirs is not a member of object org.apache.commons.io.FileUtils [ERROR] val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE) [ERROR] ^ {noformat} Because that compilation uses commons-io version 2.1 and FileUtils#listFilesAndDirs method was added at commons-io version 2.2, this compilation always fails. FileUtils#listFilesAndDirs → http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFilesAndDirs%28java.io.File,%20org.apache.commons.io.filefilter.IOFileFilter,%20org.apache.commons.io.filefilter.IOFileFilter%29 Because a hadoop-client in those problematic version depends on commons-io 2.1 not 2.4, we should have assumption that commons-io is version 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159573#comment-14159573 ] Sean Owen commented on SPARK-3561: -- I'd be interested to see a more specific motivating use case. Is this about using Tez for example, and where does it help to stack Spark on Tez on YARN? or MR2, etc. Spark Core and Tez overlap, to be sure, and I'm not sure how much value it adds to run one on the other. Kind of like running Oracle on MySQL or something. For whatever it is: is it maybe not more natural to integrate the feature into Spark itself? It would be great if it this were all just a matter of one extra trait and interface. In practice I suspect there are a number of hidden assumptions throughout the code that may leak through attempts at this abstraction. I am definitely asking rather than asserting, curious to see more specifics about the upside. Expose pluggable architecture to facilitate native integration with third-party execution environments. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159587#comment-14159587 ] Sean Owen commented on SPARK-3803: -- I agree with your assessment. It would take some work, though not terribly much, to rewrite this method to correctly handle A with more than 46340 columns. At n = 46340, the Gramian already consumes about 8.5GB of memory, so it's kinda getting big to realistically use in core anyway. At the least, an error should be raised if n is too large. Any one else think this should be supported though? Would be nice, but, practically helpful? ArrayIndexOutOfBoundsException found in executing computePrincipalComponents Key: SPARK-3803 URL: https://issues.apache.org/jira/browse/SPARK-3803 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Masaru Dobashi When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} I think this was because I created HashingTF instance with default numFeatures and Array is used in RowMatrix#computeGramianMatrix method like the following. {code} /** * Computes the Gramian matrix `A^T A`. */ def computeGramianMatrix(): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) =
[jira] [Resolved] (SPARK-3784) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3784. -- Resolution: Duplicate Duplicate of SPARK-3785 Support off-loading computations to a GPU - Key: SPARK-3784 URL: https://issues.apache.org/jira/browse/SPARK-3784 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Thomas Darimont Priority: Minor Are there any plans to adding support for off-loading computations to the GPU, e.g. via an open-cl binding? http://www.jocl.org/ https://code.google.com/p/javacl/ http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160139#comment-14160139 ] Sean Owen commented on SPARK-3785: -- In broad terms, I find there are few computations that a GPU can speed up, just because of the overhead of getting data from the JVM into the GPU and back. It makes sense for large computations where the computation is disproportionately large compared to the data (like maybe solving a big linear system, etc.) They exist but are rare. Is there something specific to Spark here? you can use any JVM-based library you like to do what you like in a Spark program. Support off-loading computations to a GPU - Key: SPARK-3785 URL: https://issues.apache.org/jira/browse/SPARK-3785 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Thomas Darimont Priority: Minor Are there any plans to adding support for off-loading computations to the GPU, e.g. via an open-cl binding? http://www.jocl.org/ https://code.google.com/p/javacl/ http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org