[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203316#comment-14203316 ] Lianhui Wang commented on SPARK-2468: - ok, thanks.[~adav] i will try to do as you say. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4305) yarn-alpha profile won't build due to network/yarn module
Sean Owen created SPARK-4305: Summary: yarn-alpha profile won't build due to network/yarn module Key: SPARK-4305 URL: https://issues.apache.org/jira/browse/SPARK-4305 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sean Owen Priority: Minor SPARK-3797 introduced the {{network/yarn}} module, but its YARN code depends on YARN APIs not present in older versions covered by the {{yarn-alpha}}. As a result builds like {{mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package}} fail. The solution is just to not build {{network/yarn}} with profile {{yarn-alpha}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4305) yarn-alpha profile won't build due to network/yarn module
[ https://issues.apache.org/jira/browse/SPARK-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203331#comment-14203331 ] Apache Spark commented on SPARK-4305: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3167 yarn-alpha profile won't build due to network/yarn module - Key: SPARK-4305 URL: https://issues.apache.org/jira/browse/SPARK-4305 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sean Owen Priority: Minor SPARK-3797 introduced the {{network/yarn}} module, but its YARN code depends on YARN APIs not present in older versions covered by the {{yarn-alpha}}. As a result builds like {{mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package}} fail. The solution is just to not build {{network/yarn}} with profile {{yarn-alpha}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-899) Outdated Bagel documentation
[ https://issues.apache.org/jira/browse/SPARK-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-899. - Resolution: Won't Fix Trawling old issues again... I assume this is a WontFix because GraphX has superseded Bagel. Outdated Bagel documentation Key: SPARK-899 URL: https://issues.apache.org/jira/browse/SPARK-899 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.7.3 Reporter: Matteo Ceccarello The documentation for Bagel at http://spark.incubator.apache.org/docs/latest/bagel-programming-guide.html seems to be outdated. In the code example it refers to an Edge class that does not exist in Bagel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-952) Python version of Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-952. - Resolution: Duplicate I assume this is superseded, if anything, by SPARK-3588 Python version of Gaussian Mixture Model Key: SPARK-952 URL: https://issues.apache.org/jira/browse/SPARK-952 Project: Spark Issue Type: Story Components: Examples Affects Versions: 0.7.3 Reporter: caizhua Priority: Minor Labels: Learning This piece of code is written by Shangyu Luo at Rice University. The code is to learn the Gaussian Mixture Model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-951) Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-951. - Resolution: Duplicate Duplicate of I assume this is superseded, if anything, by SPARK-3588 Gaussian Mixture Model -- Key: SPARK-951 URL: https://issues.apache.org/jira/browse/SPARK-951 Project: Spark Issue Type: Story Components: Examples Affects Versions: 0.7.3 Reporter: caizhua Priority: Critical Labels: Learning, Machine, Model This code includes the code for Gaussian Mixture Model. The input file named Gmm_spark.tbl is the input for this program. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-956) The Spark python program for Lasso
[ https://issues.apache.org/jira/browse/SPARK-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-956. - Resolution: Won't Fix I assume this is WontFix as there was no followup. Spark has L1 regularization implemented already anyway. The Spark python program for Lasso -- Key: SPARK-956 URL: https://issues.apache.org/jira/browse/SPARK-956 Project: Spark Issue Type: Story Components: Examples Affects Versions: 0.7.3 Reporter: caizhua The code describes the Spark python implementation of Lasso -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-954) One repeated sampling, and I am not sure if it is correct.
[ https://issues.apache.org/jira/browse/SPARK-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-954. - Resolution: Won't Fix From the discussion, and later ones about guarantees of determinism in RDDs, sounds like this is working as intended. One repeated sampling, and I am not sure if it is correct. -- Key: SPARK-954 URL: https://issues.apache.org/jira/browse/SPARK-954 Project: Spark Issue Type: Story Affects Versions: 0.7.3 Reporter: caizhua This piece of code reads the dataset, and then has two operations on the dataset. If I consider the RDD as a view definition, I think the result is correct. However, since the first iteration does result_sample.count(), then I was wondering whether we should repeat the computation in the initialize_doc_topic_word_count(.) function, when we run the the second result_sample.map(lambda (block_id, doc_prob): doc_prob).count(). Since people write Spark as a program not as a database view, sometimes it is confusing. For example, considering there initialize_doc_topic_word_count(.) is a statistical function with runtime seeds, I am not sure if this have impact on the result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-976) WikipediaPageRand doesn't work anymore
[ https://issues.apache.org/jira/browse/SPARK-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-976. - Resolution: Won't Fix I assume this is also WontFix as it is a Bagel example. WikipediaPageRand doesn't work anymore -- Key: SPARK-976 URL: https://issues.apache.org/jira/browse/SPARK-976 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 0.7.3 Reporter: Konstantin Boudnik Looks like wikipedia doesn't public the pages info in WEX format anymore, but instead is doing page dumps in XML format. Because of that the example is failing with IOOBE as it expects tab-separated input strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-987) Cannot start workers successfully with hadoop 2.2.0
[ https://issues.apache.org/jira/browse/SPARK-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-987. - Resolution: Not a Problem This just looks like a classic version mismatch between client and server. The app perhaps has embedded Hadoop libs instead of using 'provided' libs from the server installation. Cannot start workers successfully with hadoop 2.2.0 --- Key: SPARK-987 URL: https://issues.apache.org/jira/browse/SPARK-987 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.0 Reporter: 刘旭 Priority: Minor Labels: 2.2.0, hadoop Cannot start workers successfully with hadoop 2.2.0. I build with: $make-distribution.sh -hadoop 2.2.0 P.S. Can work well with hadoop 2.0.5-alpha. But cannot connect the hadoop 2.2.0 successfully with this exception : scala var lines = sc.textFile(hdfs://localhost:9000/user/hadoop/hadoop/hadoop-hadoop-jobtracker-master.log) lines: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at console:12 scala lines.count java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: master/192.168.3.103; destination host is: localhost:9000; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-819) netty: ChannelInboundByteHandlerAdapter no longer exist in 4.0.3.Final
[ https://issues.apache.org/jira/browse/SPARK-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-819. - Resolution: Fixed This must have been fixed along the way, as the class ChannelInboundByteHandlerAdapter is not used in the code now, nor is that version of Netty. netty: ChannelInboundByteHandlerAdapter no longer exist in 4.0.3.Final -- Key: SPARK-819 URL: https://issues.apache.org/jira/browse/SPARK-819 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 0.8.0 Reporter: Thomas Graves It appears the netty shuffle code uses netty version 4.0.0.Beta2, which by the tag was in beta. They now have 4.0.2.Final which doesn't include the api ChannelInboundByteHandlerAdapter which is used by the FileClientHandler. We should move to use a stable api. It looks like it was replaced with ChannelInboundHandlerAdapter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-905) Not able to run Job on remote machine
[ https://issues.apache.org/jira/browse/SPARK-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-905. - Resolution: Cannot Reproduce This looks like something that was either long since fixed, or just a matter of not having the Spark installation set up on each machine. The compute-classpath.sh script does exist in bin/ in the tree and distro. Not able to run Job on remote machine -- Key: SPARK-905 URL: https://issues.apache.org/jira/browse/SPARK-905 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.3 Reporter: Ayush I have two machines A and B. On machine A, I run ./run spark.deploy.master.Master to start Master. Master URL is spark://abc-vostro.local:7077. Now on machine B, I run ./run spark.deploy.worker.Worker spark://abc-vostro.local:7077 Now worker has been registered to master. Now I want to run a simple job on cluster. Here is SimpleJob.scala package spark.examples import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY/File Name val sc = new SparkContext(spark://abc-vostro.local:7077, Simple Job, System.getenv(SPARK_HOME), Seq(/home/abc/spark-scala-2.10/examples/target/scala-2.10/spark-examples_2.10-0.8.0-SNAPSHOT.jar)) val logData = sc.textFile(logFile) val numsa = logData.filter(line = line.contains(a)).count val numsb = logData.filter(line = line.contains(b)).count println(total a : %s, total b : %s.format(numsa, numsb)) } } This file is located at /home/abc/spark-scala-2.10/examples/src/main/scala/spark/examples on machine A. Now on machine A, I run sbt/sbt package. When I run MASTER=spark://abc-vostro.local:7077 ./run spark.examples.SimpleJob to run my job, I am getting below exception on both machines A and B, (class java.io.IOException: Cannot run program /home/abc/spark-scala-2.10/bin/compute-classpath.sh (in directory .): error=2, No such file or directory) Could you please help me to resolve this? This is probably something I'm missing any configuration on my end. Thanks in advance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1074) JavaPairRDD as Object File
[ https://issues.apache.org/jira/browse/SPARK-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1074. -- Resolution: Not a Problem Am I right in thinking that if you want to save a JavaPairRDD to HDFS, you have key-value pairs, and so you want to use JavaPairRDD.saveAsNewAPIHadoopFile, and SparkContext.sequenceFile to read it? This works. objectFile doesn't seem like the right approach anyway. JavaPairRDD as Object File -- Key: SPARK-1074 URL: https://issues.apache.org/jira/browse/SPARK-1074 Project: Spark Issue Type: Bug Components: Input/Output, Java API Affects Versions: 0.9.0 Reporter: Kevin Mader Priority: Minor So I can perform a save command on a JavaPairRDD {code:java} static public void HSave(JavaPairRDDD3int, int[] baseImg,String path) { final String outpath=(new File(path)).getAbsolutePath(); baseImg.saveAsObjectFile(outpath); } {code} When I use the objectFile command from the JavaSparkContext {code:java} static public ReadObjectFile(JavaSparkContext jsc, final String path) { JavaPairRDDD3int, int[] newImage=(JavaPairRDDD3int,int[]) jsc.objectFile(path); } {code} I get an error cannot cast from JavaRDD to JavaPairRDD. Is there a way to get back to JavaPairRDD or will I need to map my data to a JavaRDD, save, load, then remap the JavaRDD back to the JavaPairRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1227) Diagnostics for ClassificationRegression
[ https://issues.apache.org/jira/browse/SPARK-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203354#comment-14203354 ] Sean Owen commented on SPARK-1227: -- Is this still relevant now that conventional classifier and regressor metrics are implemented in MLlib? You wouldn't be able to compare models by their loss function in general anyway. Diagnostics for ClassificationRegression - Key: SPARK-1227 URL: https://issues.apache.org/jira/browse/SPARK-1227 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Martin Jaggi Assignee: Martin Jaggi Currently, the attained objective function is not computed (for efficiency reasons, as one evaluation requires one full pass through the data). For diagnostics and comparing different algorithms, we should however provide this as a separate function (one MR). Doing this requires the loss and regularizer functions themselves, not only their gradients (which are currently in the Gradient class). How about adding the new function directly on the corresponding models in classification/* and regression/* ? Any thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1196) val variables not available within RDD map on cluster app; are on shell or local
[ https://issues.apache.org/jira/browse/SPARK-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1196. -- Resolution: Cannot Reproduce val variables not available within RDD map on cluster app; are on shell or local Key: SPARK-1196 URL: https://issues.apache.org/jira/browse/SPARK-1196 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Andrew Kerr When this code {code} def foo = foo val bar = bar val data = sc.parallelize(Seq(a)) data.map{a = print(1,foo,bar);a}.map{a = print(2,foo,bar);a}.map{a = print(3,foo,bar);a}.collect() {code} is run on a cluster on the spark shell a slave's stdout is {code} (1,foo,bar)(2,foo,bar)(3,foo,bar) {code} as expected. However when the code {code} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ object twitterAggregation extends App { val conf = new SparkConf() .setMaster(spark://xx.compute-1.amazonaws.com:7077) .setAppName(testCase) .setJars(List(target/scala-2.10/spark-test-case_2.10-1.0.jar)) .setSparkHome(/root/spark/) val sc = new SparkContext(conf) def foo = foo val bar = bar val data = sc.parallelize(Seq(a)) data.map{a = print(1,foo,bar);a}.map{a = print(2,foo,bar);a}.map{a = print(3,foo,bar);a}.collect() } {code} is run against a cluster as an application via sbt the stdout on a slave is {code} (1,foo,null)(2,foo,null)(3,foo,null) {code} The variable declared with val is now null when the anon functions in the map are executed. When the application is run in local mode the output is {code} (1,foo,bar)(2,foo,bar)(3,foo,bar) {code} as wanted. build.sbt is {code} name := spark-test-case version := 1.0 scalaVersion:=2.10.3 resolvers ++= Seq(Akka Repository at http://repo.akka.io/releases/;) libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 0.9.0-incubating) {code} To avoid firewall and NAT issues the project directory is rsynced onto the master where is is build with SBT 0.13.1 {code} wget http://repo.scala-sbt.org/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.13.1/sbt.rpm rpm --install sbt.rpm sbt package sbt run {code} Cluster created with scripts in the hadoop2 0.9.0 download. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1344) Scala API docs for top methods
[ https://issues.apache.org/jira/browse/SPARK-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203370#comment-14203370 ] Apache Spark commented on SPARK-1344: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3168 Scala API docs for top methods -- Key: SPARK-1344 URL: https://issues.apache.org/jira/browse/SPARK-1344 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.0 Reporter: Diana Carroll The RDD.top() methods are documented as follows: bq. Returns the top *K* elements from this RDD using the natural ordering for *T*. bq. Returns the top *K* elements from this RDD as defined by the specified Comparator[[T]]. I believe those should read bq. Returns the top *num* elements from this RDD using the natural ordering for *K*. bq. Returns the top *num* elements from this RDD as defined by the specified Comparator[[K]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-971) Link to Confluence wiki from project website / documentation
[ https://issues.apache.org/jira/browse/SPARK-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203372#comment-14203372 ] Apache Spark commented on SPARK-971: User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3169 Link to Confluence wiki from project website / documentation Key: SPARK-971 URL: https://issues.apache.org/jira/browse/SPARK-971 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen Spark's Confluence wiki (https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage) is really hard to find; try a Google search for apache spark wiki, for example. We should link to the wiki from the Spark project website and documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1267) Add a pip installer for PySpark
[ https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabin Banka closed SPARK-1267. --- Resolution: Unresolved Closing this PR for now. Add a pip installer for PySpark --- Key: SPARK-1267 URL: https://issues.apache.org/jira/browse/SPARK-1267 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Prabin Banka Priority: Minor Labels: pyspark Please refer to this mail archive, http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1509) add zipWithIndex zipWithUniqueId methods to java api
[ https://issues.apache.org/jira/browse/SPARK-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1509: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) add zipWithIndex zipWithUniqueId methods to java api Key: SPARK-1509 URL: https://issues.apache.org/jira/browse/SPARK-1509 Project: Spark Issue Type: Improvement Components: Java API Reporter: Guoqiang Li Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1509) add zipWithIndex zipWithUniqueId methods to java api
[ https://issues.apache.org/jira/browse/SPARK-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1509. -- Resolution: Fixed This was actually fixed in about 1.1 add zipWithIndex zipWithUniqueId methods to java api Key: SPARK-1509 URL: https://issues.apache.org/jira/browse/SPARK-1509 Project: Spark Issue Type: Improvement Components: Java API Reporter: Guoqiang Li Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1530) Streaming UI test can hang indefinitely
[ https://issues.apache.org/jira/browse/SPARK-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1530. -- Resolution: Fixed I assume this got fixed along the way, either by [~tdas]'s changes or [~shaneknapp]'s changes to Jenkins? Streaming UI test can hang indefinitely --- Key: SPARK-1530 URL: https://issues.apache.org/jira/browse/SPARK-1530 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Tathagata Das This has been causing Jenkins to hang recently: {code} pool-1-thread-1 prio=10 tid=0x7f4b9449f000 nid=0x6c37 runnable [0x7f4b8a26c000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked 0x0007cad700d0 (a java.io.BufferedInputStream) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323) - locked 0x0007cad662b8 (a sun.net.www.protocol.http.HttpURLConnection) at java.net.URL.openStream(URL.java:1037) at scala.io.Source$.fromURL(Source.scala:140) at scala.io.Source$.fromURL(Source.scala:130) at org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply$mcV$sp(UISuite.scala:57) at org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply(UISuite.scala:56) at org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply(UISuite.scala:56) at org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:394) at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:408) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:437) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:477) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:477) at org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(UISuite.scala:56) at org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(UISuite.scala:54) at org.apache.spark.LocalSparkContext$.withSpark(LocalSparkContext.scala:60) at org.apache.spark.ui.UISuite$$anonfun$2.apply$mcV$sp(UISuite.scala:54) at org.apache.spark.ui.UISuite$$anonfun$2.apply(UISuite.scala:54) at org.apache.spark.ui.UISuite$$anonfun$2.apply(UISuite.scala:54) at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265) at org.scalatest.Suite$class.withFixture(Suite.scala:1974) at org.apache.spark.ui.UISuite.withFixture(UISuite.scala:37) at org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262) at org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) at org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198) at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271) at org.apache.spark.ui.UISuite.runTest(UISuite.scala:37) at org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) at org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) at org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260) at org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326) at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304) at org.apache.spark.ui.UISuite.runTests(UISuite.scala:37) at org.scalatest.Suite$class.run(Suite.scala:2303) at org.apache.spark.ui.UISuite.org$scalatest$FunSuite$$super$run(UISuite.scala:37) at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310) at
[jira] [Resolved] (SPARK-1229) train on array (in addition to RDD)
[ https://issues.apache.org/jira/browse/SPARK-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1229. -- Resolution: Not a Problem If I may be so bold: generally, you train on lots of data, so an RDD makes more sense than an Array as training input. That said, you can always parallelize an Array as an RDD and train on it, which is most of the use case here. If you really mean you need a sample method that returns an RDD, yes that exists as you see. train on array (in addition to RDD) --- Key: SPARK-1229 URL: https://issues.apache.org/jira/browse/SPARK-1229 Project: Spark Issue Type: Story Components: MLlib Reporter: Arshak Navruzyan since predict method accepts either RDD or Array for consistency so should train. (particularly since RDD.takeSample() returns Array) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1887) Update Contributors Guide with useful data from past threads
[ https://issues.apache.org/jira/browse/SPARK-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203409#comment-14203409 ] Sean Owen commented on SPARK-1887: -- [~sujeetv] do you plan to work on this? you just need to open a PR, and don't need this assigned to you. Update Contributors Guide with useful data from past threads -- Key: SPARK-1887 URL: https://issues.apache.org/jira/browse/SPARK-1887 Project: Spark Issue Type: Documentation Components: Documentation Environment: Development Reporter: Sujeet Varakhedi Assignee: Sujeet Varakhedi Priority: Minor Labels: documentation, newbie Original Estimate: 168h Remaining Estimate: 168h The goal here is to mine through dev email threads and add useful data to Contributors Page. This will save quite bit of time for new-comers like me to ramp up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
Varadharajan created SPARK-4306: --- Summary: LogisticRegressionWithLBFGS support for PySpark MLlib Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Varadharajan Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm as include examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203480#comment-14203480 ] Varadharajan edited comment on SPARK-4306 at 11/8/14 4:01 PM: -- I would like to work on this. Please assign this task to me was (Author: srinathsmn): Please assign this task to me. LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm as include examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203480#comment-14203480 ] Varadharajan commented on SPARK-4306: - Please assign this task to me. LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm as include examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varadharajan updated SPARK-4306: Issue Type: New Feature (was: Task) LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm as include examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2624) Datanucleus jars not accessible in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203631#comment-14203631 ] Jim Lim commented on SPARK-2624: Line 223 in {{org/apache/spark/deploy/yarn/ClientBase.java}} seems like a good spot. {quote} def prepareLocalResources(...) = \{ val cachedSecondaryJarLinks = ListBuffer.empty\[String\] ... \} {quote} The datanucleus jars that need to be added are: datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar Datanucleus jars not accessible in yarn-cluster mode Key: SPARK-2624 URL: https://issues.apache.org/jira/browse/SPARK-2624 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.1 Reporter: Andrew Or Assignee: Jim Lim Fix For: 1.2.0 This is because we add it to the class path of the command that launches spark submit, but the containers never get it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()
[ https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4301: - Affects Version/s: (was: 1.2.0) 1.0.1 StreamingContext should not allow start() to be called after calling stop() --- Key: SPARK-4301 URL: https://issues.apache.org/jira/browse/SPARK-4301 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.1.1, 1.2.0 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call {{stop()}} on a fresh StreamingContext followed by {{start()}}. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow {{start() stop() start()}} then I don't think it makes sense to allow {{stop() start()}}. The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures. Prior discussions: - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490 - https://github.com/apache/spark/pull/3121#issuecomment-61927353 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()
[ https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4301. -- Resolution: Fixed StreamingContext should not allow start() to be called after calling stop() --- Key: SPARK-4301 URL: https://issues.apache.org/jira/browse/SPARK-4301 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.1.1, 1.2.0 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call {{stop()}} on a fresh StreamingContext followed by {{start()}}. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow {{start() stop() start()}} then I don't think it makes sense to allow {{stop() start()}}. The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures. Prior discussions: - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490 - https://github.com/apache/spark/pull/3121#issuecomment-61927353 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()
[ https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4301: - Fix Version/s: 1.2.0 1.1.1 StreamingContext should not allow start() to be called after calling stop() --- Key: SPARK-4301 URL: https://issues.apache.org/jira/browse/SPARK-4301 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.1.1, 1.2.0 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call {{stop()}} on a fresh StreamingContext followed by {{start()}}. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow {{start() stop() start()}} then I don't think it makes sense to allow {{stop() start()}}. The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures. Prior discussions: - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490 - https://github.com/apache/spark/pull/3121#issuecomment-61927353 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()
[ https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4301: - Fix Version/s: 1.0.3 StreamingContext should not allow start() to be called after calling stop() --- Key: SPARK-4301 URL: https://issues.apache.org/jira/browse/SPARK-4301 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.1.1, 1.2.0, 1.0.3 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call {{stop()}} on a fresh StreamingContext followed by {{start()}}. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow {{start() stop() start()}} then I don't think it makes sense to allow {{stop() start()}}. The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures. Prior discussions: - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490 - https://github.com/apache/spark/pull/3121#issuecomment-61927353 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203757#comment-14203757 ] Shivaram Venkataraman commented on SPARK-3821: -- [~nchammas] Thanks for putting this together -- This is looking great ! I just had a couple of quick questions, clarifications 1. My preference would be to just have a single AMI across Spark versions for a couple of reasons. First it reduces steps for every release (even though creating AMIs is definitely much simpler now !). Also the number of AMIs we maintain could get large if we do this for every minor and major release like 1.1.1. [~pwendell] could probably comment more on the release process etc. 2. Could you clarify if Hadoop is pre-installed in new AMIs or are is it still installed on startup ? The flexibility we right now have of switching between Hadoop 1, Hadoop 2, YARN etc. is useful for testing. (Related packer question: Are the [init scripts| https://github.com/nchammas/spark-ec2/blob/packer/packer/spark-packer.json#L129] run during AMI creation or during startup ?) 3. Do you have some benchmarks for the new AMI without Spark 1.1.0 pre-installed ? [We right now have old AMI vs. new AMI with spark|https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md#new-amis---latest-os-updates-and-spark-110-pre-installed-single-run] . I see a couple of huge wins in the new AMI (from SSH wait time, ganglia init etc.) which I guess we should get even without Spark being pre-installed. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4307) Initialize FileDescriptor lazily in FileRegion
Reynold Xin created SPARK-4307: -- Summary: Initialize FileDescriptor lazily in FileRegion Key: SPARK-4307 URL: https://issues.apache.org/jira/browse/SPARK-4307 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Reynold Xin Assignee: Reynold Xin We use Netty's DefaultFileRegion to do zero copy send. However, DefaultFileRegion requires a FileDescriptor, which results in a large number of opened files in larger workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4307) Initialize FileDescriptor lazily in FileRegion
[ https://issues.apache.org/jira/browse/SPARK-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203778#comment-14203778 ] Apache Spark commented on SPARK-4307: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/3172 Initialize FileDescriptor lazily in FileRegion -- Key: SPARK-4307 URL: https://issues.apache.org/jira/browse/SPARK-4307 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Reynold Xin Assignee: Reynold Xin We use Netty's DefaultFileRegion to do zero copy send. However, DefaultFileRegion requires a FileDescriptor, which results in a large number of opened files in larger workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2213) Sort Merge Join
[ https://issues.apache.org/jira/browse/SPARK-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203783#comment-14203783 ] Apache Spark commented on SPARK-2213: - User 'Ishiihara' has created a pull request for this issue: https://github.com/apache/spark/pull/3173 Sort Merge Join --- Key: SPARK-2213 URL: https://issues.apache.org/jira/browse/SPARK-2213 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Liquan Pei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203786#comment-14203786 ] Nicholas Chammas commented on SPARK-3821: - Thanks for the feedback [~shivaram]. {quote} 1. My preference would be to just have a single AMI across Spark versions for a couple of reasons. {quote} I agree. Maintaining images for specific versions of Spark is worth it only if you're really crazy about getting the lowest cluster launch times possible. Well, that was my [original motivation | http://apache-spark-developers-list.1001551.n3.nabble.com/EC2-clusters-ready-in-launch-time-30-seconds-td7262.html] for doing this work, but ultimately I agree the complexity is not worth it at the moment. I'll take this out unless someone wants to advocate for leaving it in. {quote} 2. Could you clarify if Hadoop is pre-installed in new AMIs or are is it still installed on startup ? {quote} Currently, I have it set to install Hadoop 2 on the AMIs with Spark pre-installed. Again, this was done with the intention of aiming for the lowest launch time possible, but if we'd like to do away with the Spark-pre-installed AMIs then this is not an issue. {quote} Are the init scripts run during AMI creation or during startup ? {quote} For the AMIs with Spark pre-installed, they are run during AMI creation. That's why the [init runtimes in the second benchmark | https://github.com/nchammas/spark-ec2/blob/214d5e4cac392a0eac21f949fe25c0075044411f/packer/proposal.md#new-amis---latest-os-updates-and-spark-110-pre-installed-single-run] are all 0 ms; the init script sees that such and such is already installed and just exits. {quote} 3. Do you have some benchmarks for the new AMI without Spark 1.1.0 pre-installed ? {quote} Nope, but I can run one and get back to you on Monday or Tuesday with those numbers. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org