Re: Use mvn to build Spark 1.2.0 failed
I just tried the exact same command and do not see any error. Maybe you can make sure you're starting from a clean extraction of the distro, and check your environment. I'm on OSX, Maven 3.2, Java 8 but I don't know that any of those would be relevant. On Mon, Dec 22, 2014 at 4:10 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi all, Today download Spark source from http://spark.apache.org/downloads.html page, and I use ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests -Dhadoop.version=2.2.0 -Phive to build the release, but I encountered an exception as follow: [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ spark-parent --- [INFO] Source directory: /home/q/spark/spark-1.2.0/src/main/scala added. [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-parent --- [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. FAILURE [1.015s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPPED [INFO] Spark Project Bagel ... SKIPPED [INFO] Spark Project GraphX .. SKIPPED [INFO] Spark Project Streaming ... SKIPPED [INFO] Spark Project Catalyst SKIPPED [INFO] Spark Project SQL . SKIPPED [INFO] Spark Project ML Library .. SKIPPED [INFO] Spark Project Tools ... SKIPPED [INFO] Spark Project Hive SKIPPED [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] Spark Project YARN Shuffle Service SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 1.644s [INFO] Finished at: Mon Dec 22 10:56:35 CST 2014 [INFO] Final Memory: 21M/481M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) on project spark-parent: Error finding remote resources manifests: /home/q/spark/spark-1.2.0/target/maven-shared-archive-resources/META-INF/NOTICE (No such file or directory) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException but the NOTICE file is in the download spark release: [wyp@spark /home/q/spark/spark-1.2.0]$ ll total 248 drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 assembly drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 bagel drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 bin drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 conf -rw-rw-r-- 1 1000 1000 663 Dec 10 18:02 CONTRIBUTING.md drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 core drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 data drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 dev drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 docker drwxrwxr-x 7 1000 1000 4096 Dec 10 18:02 docs drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 ec2 drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 examples drwxrwxr-x 8 1000 1000 4096 Dec 10 18:02 external drwxrwxr-x 5 1000 1000 4096 Dec 10 18:02 extras drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 graphx -rw-rw-r-- 1 1000 1000 45242 Dec 10 18:02 LICENSE -rwxrwxr-x 1 1000 1000 7941 Dec 10 18:02 make-distribution.sh drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 mllib drwxrwxr-x 5 1000 1000 4096 Dec 10 18:02 network -rw-rw-r-- 1 1000 1000 22559 Dec 10 18:02 NOTICE -rw-rw-r-- 1 1000 1000 49002 Dec 10 18:02 pom.xml drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 project drwxrwxr-x 6 1000 1000 4096 Dec 10 18:02 python -rw-rw-r-- 1 1000 1000 3645 Dec 10 18:02
Spark exception when sending message to akka actor
Hi All, I have akka remote actors running on 2 nodes. I submitted spark application from node1. In the spark code, in one of the rdd, i am sending message to actor running on node1. My Spark code is as follows: class ActorClient extends Actor with Serializable { import context._ val currentActor: ActorSelection = context.system.actorSelection(akka.tcp:// ActorSystem@192.168.145.183:2551/user/MasterActor) implicit val timeout = Timeout(10 seconds) def receive = { case msg:String = { if(msg.contains(Spark)) { currentActor ! msg sender ! Local } else { println(Received..+msg) val future=currentActor ? msg val result = Await.result(future, timeout.duration).asInstanceOf[String] if(result.contains(ACK)) sender ! OK } } case PoisonPill = context.stop(self) } } object SparkExec extends Serializable { implicit val timeout = Timeout(10 seconds) val actorSystem=ActorSystem(ClientActorSystem) val actor=actorSystem.actorOf(Props(classOf[ActorClient]),name=ClientActor) def main(args:Array[String]) = { val conf = new SparkConf().setAppName(DeepLearningSpark) val sc=new SparkContext(conf) val textrdd=sc.textFile(hdfs://IMPETUS-DSRV02:9000/deeplearning/sample24k.csv) val rdd1=textrddmap{ line = println(In Map...) val future = actor ? Hello..Spark val result = Await.result(future,timeout.duration).asInstanceOf[String] if(result.contains(Local)){ println(Recieved in map+result) //actorSystem.shutdown } (10) } val rdd2=rdd1.map{ x = val future=actor ? Done val result = Await.result(future, timeout.duration).asInstanceOf[String] if(result.contains(OK)) { actorSystem.stop(remoteActor) actorSystem.shutdown } (2) } rdd2.saveAsTextFile(/home/padma/SparkAkkaOut) } } In my ActorClientActor, through actorSelection, identifying the remote actor and sending the message. Once the messages are sent, in *rdd2*, after receiving ack from remote actor, i am killing the actor ActorClient and shutting down the ActorSystem. The above code is throwing the following exception: 14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.ExceptionInInitializerError: com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166) com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) 14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in): java.lang.NoClassDefFoundError: Could not initialize class com.impetus.spark.SparkExec$ com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166) com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
Tuning Spark Streaming jobs
Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune the jobs in order to maximize throughput. We have summarized our findings in a blog post with the intention of collecting feedback and hoping that it is useful to other Spark Streaming users facing similar issues. http://www.virdata.com/tuning-spark/ Your feedback is welcome. With kind regards, Gerard. Data Processing Team Lead Virdata.com @maasg
Re: Tuning Spark Streaming jobs
Hi Gerard, Really nice guide! I'm particularly interested in the Mesos scheduling side to more evenly distribute cores across cluster. I wonder if you are using coarse grain mode or fine grain mode? I'm making changes to the spark mesos scheduler and I think we can propose a best way to achieve what you mentioned. Tim Sent from my iPhone On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune the jobs in order to maximize throughput. We have summarized our findings in a blog post with the intention of collecting feedback and hoping that it is useful to other Spark Streaming users facing similar issues. http://www.virdata.com/tuning-spark/ Your feedback is welcome. With kind regards, Gerard. Data Processing Team Lead Virdata.com @maasg - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Use mvn to build Spark 1.2.0 failed
I also couldn't reproduce this issued. On Mon, Dec 22, 2014 at 2:24 AM, Sean Owen so...@cloudera.com wrote: I just tried the exact same command and do not see any error. Maybe you can make sure you're starting from a clean extraction of the distro, and check your environment. I'm on OSX, Maven 3.2, Java 8 but I don't know that any of those would be relevant. On Mon, Dec 22, 2014 at 4:10 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi all, Today download Spark source from http://spark.apache.org/downloads.html page, and I use ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests -Dhadoop.version=2.2.0 -Phive to build the release, but I encountered an exception as follow: [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ spark-parent --- [INFO] Source directory: /home/q/spark/spark-1.2.0/src/main/scala added. [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-parent --- [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. FAILURE [1.015s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPPED [INFO] Spark Project Bagel ... SKIPPED [INFO] Spark Project GraphX .. SKIPPED [INFO] Spark Project Streaming ... SKIPPED [INFO] Spark Project Catalyst SKIPPED [INFO] Spark Project SQL . SKIPPED [INFO] Spark Project ML Library .. SKIPPED [INFO] Spark Project Tools ... SKIPPED [INFO] Spark Project Hive SKIPPED [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] Spark Project YARN Shuffle Service SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 1.644s [INFO] Finished at: Mon Dec 22 10:56:35 CST 2014 [INFO] Final Memory: 21M/481M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) on project spark-parent: Error finding remote resources manifests: /home/q/spark/spark-1.2.0/target/maven-shared-archive-resources/META-INF/NOTICE (No such file or directory) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException but the NOTICE file is in the download spark release: [wyp@spark /home/q/spark/spark-1.2.0]$ ll total 248 drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 assembly drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 bagel drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 bin drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 conf -rw-rw-r-- 1 1000 1000 663 Dec 10 18:02 CONTRIBUTING.md drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 core drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 data drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 dev drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 docker drwxrwxr-x 7 1000 1000 4096 Dec 10 18:02 docs drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 ec2 drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 examples drwxrwxr-x 8 1000 1000 4096 Dec 10 18:02 external drwxrwxr-x 5 1000 1000 4096 Dec 10 18:02 extras drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 graphx -rw-rw-r-- 1 1000 1000 45242 Dec 10 18:02 LICENSE -rwxrwxr-x 1 1000 1000 7941 Dec 10 18:02 make-distribution.sh drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 mllib drwxrwxr-x 5 1000 1000 4096 Dec 10 18:02 network -rw-rw-r-- 1 1000 1000 22559 Dec 10 18:02 NOTICE -rw-rw-r-- 1 1000 1000 49002 Dec 10 18:02 pom.xml drwxrwxr-x 4 1000 1000 4096 Dec 10
cleaning up cache files left by SPARK-2713
Is there a reason not to go ahead and move the _cache and _lock files created by Utils.fetchFiles into the work directory, so they can be cleaned up more easily? I saw comments to that effect in the discussion of the PR for 2713, but it doesn't look like it got done. And no, I didn't just have a machine fill up the /tmp directory, why do you ask? :)
Re: cleaning up cache files left by SPARK-2713
https://github.com/apache/spark/pull/3705 On Mon, Dec 22, 2014 at 10:19 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason not to go ahead and move the _cache and _lock files created by Utils.fetchFiles into the work directory, so they can be cleaned up more easily? I saw comments to that effect in the discussion of the PR for 2713, but it doesn't look like it got done. And no, I didn't just have a machine fill up the /tmp directory, why do you ask? :) -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: spark-yarn_2.10 1.2.0 artifacts
Thank you, Sean, using spark-network-yarn seems to do the trick. On 12/19/2014 12:13 PM, Sean Owen wrote: I believe spark-yarn does not exist from 1.2 onwards. Have a look at spark-network-yarn for where some of that went, I believe. On Fri, Dec 19, 2014 at 5:09 PM, David McWhorter mcwhor...@ccri.com wrote: Hi all, Thanks for your work on spark! I am trying to locate spark-yarn jars for the new 1.2.0 release. The jars for spark-core, etc, are on maven central, but the spark-yarn jars are missing. Confusingly and perhaps relatedly, I also can't seem to get the spark-yarn artifact to install on my local computer when I run 'mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean install'. At the install plugin stage, maven reports: [INFO] --- maven-install-plugin:2.5.1:install (default-install) @ spark-yarn_2.10 --- [INFO] Skipping artifact installation Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven build would be appreciated. David -- David McWhorter Software Engineer Commonwealth Computer Research, Inc. 1422 Sachem Place, Unit #1 Charlottesville, VA 22901 mcwhor...@ccri.com | 434.299.0090x204 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- David McWhorter Software Engineer Commonwealth Computer Research, Inc. 1422 Sachem Place, Unit #1 Charlottesville, VA 22901 mcwhor...@ccri.com | 434.299.0090x204 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Tuning Spark Streaming jobs
Hi Tim, That would be awesome. We have seen some really disparate Mesos allocations for our Spark Streaming jobs. (like (7,4,1) over 3 executors for 4 kafka consumer instead of the ideal (3,3,3,3)) For network dependent consumers, achieving an even deployment would provide a reliable and reproducible streaming job execution from the performance point of view. We're deploying in coarse grain mode. Not sure Spark Streaming would work well in fine-grained given the added latency to acquire a worker. You mention that you're changing the Mesos scheduler. Is there a Jira where this job is taking place? -kr, Gerard. On Mon, Dec 22, 2014 at 6:01 PM, Timothy Chen tnac...@gmail.com wrote: Hi Gerard, Really nice guide! I'm particularly interested in the Mesos scheduling side to more evenly distribute cores across cluster. I wonder if you are using coarse grain mode or fine grain mode? I'm making changes to the spark mesos scheduler and I think we can propose a best way to achieve what you mentioned. Tim Sent from my iPhone On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune the jobs in order to maximize throughput. We have summarized our findings in a blog post with the intention of collecting feedback and hoping that it is useful to other Spark Streaming users facing similar issues. http://www.virdata.com/tuning-spark/ Your feedback is welcome. With kind regards, Gerard. Data Processing Team Lead Virdata.com @maasg
Re: Data source interface for making multiple tables available for query
I agree and this is something that we have discussed in the past. Essentially I think instead of creating a RelationProvider that returns a single table, we'll have something like an external catalog that can return multiple base relations. On Sun, Dec 21, 2014 at 6:43 PM, Venkata ramana gollamudi ramana.gollam...@huawei.com wrote: Hi, Data source ddl.scala, CREATE TEMPORARY TABLE makes one table at time available to temp tables, how about the case if multiple/all tables from some data source needs to be available for query, just like hive tables. I think we also need that interface to connect such data sources. Please comment. Regards, Ramana
Announcing Spark Packages
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
More general submitJob API
Fellow Sparkers, I'm rather puzzled at the submitJob API. I can't quite figure out how it is supposed to be used. Is there any more documentation about it? Also, is there any simpler way to multiplex jobs on the cluster, such as starting multiple computations in as many threads in the driver and reaping all the results when they are available? Thanks, Alex
Re: Announcing Spark Packages
Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Announcing Spark Packages
Xiangrui asked me to report that it's back and running :) On Mon, Dec 22, 2014 at 3:21 PM, peng pc...@uowmail.edu.au wrote: Me 2 :) On 12/22/2014 06:14 PM, Andrew Ash wrote: Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I'm happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I'd like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Announcing Spark Packages
Hello Xiangrui, If you have not already done so, you should look at http://www.apache.org/foundation/marks/#domains for the policy on use of ASF trademarked terms in domain names. thanks — Hitesh On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: More general submitJob API
Hi Alex, SparkContext.submitJob() is marked as experimental -- most client programs shouldn't be using it. What are you looking to do? For multiplexing jobs, one thing you can do is have multiple threads in your client JVM each submit jobs on your SparkContext job. This is described here in the docs: http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application Andrew On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta alexbare...@gmail.com wrote: Fellow Sparkers, I'm rather puzzled at the submitJob API. I can't quite figure out how it is supposed to be used. Is there any more documentation about it? Also, is there any simpler way to multiplex jobs on the cluster, such as starting multiple computations in as many threads in the driver and reaping all the results when they are available? Thanks, Alex
Re: More general submitJob API
Andrew, Thanks, yes, this is what I wanted: basically just to start multiple jobs concurrently in threads. Alex On Mon, Dec 22, 2014 at 4:04 PM, Andrew Ash and...@andrewash.com wrote: Hi Alex, SparkContext.submitJob() is marked as experimental -- most client programs shouldn't be using it. What are you looking to do? For multiplexing jobs, one thing you can do is have multiple threads in your client JVM each submit jobs on your SparkContext job. This is described here in the docs: http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application Andrew On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta alexbare...@gmail.com wrote: Fellow Sparkers, I'm rather puzzled at the submitJob API. I can't quite figure out how it is supposed to be used. Is there any more documentation about it? Also, is there any simpler way to multiplex jobs on the cluster, such as starting multiple computations in as many threads in the driver and reaping all the results when they are available? Thanks, Alex
Re: More general submitJob API
A SparkContext is thread safe, so you can just have different threads that create their own RDD's and do actions, etc. - Patrick On Mon, Dec 22, 2014 at 4:15 PM, Alessandro Baretta alexbare...@gmail.com wrote: Andrew, Thanks, yes, this is what I wanted: basically just to start multiple jobs concurrently in threads. Alex On Mon, Dec 22, 2014 at 4:04 PM, Andrew Ash and...@andrewash.com wrote: Hi Alex, SparkContext.submitJob() is marked as experimental -- most client programs shouldn't be using it. What are you looking to do? For multiplexing jobs, one thing you can do is have multiple threads in your client JVM each submit jobs on your SparkContext job. This is described here in the docs: http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application Andrew On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta alexbare...@gmail.com wrote: Fellow Sparkers, I'm rather puzzled at the submitJob API. I can't quite figure out how it is supposed to be used. Is there any more documentation about it? Also, is there any simpler way to multiplex jobs on the cluster, such as starting multiple computations in as many threads in the driver and reaping all the results when they are available? Thanks, Alex - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Announcing Spark Packages
Hey Nick, I think Hitesh was just trying to be helpful and point out the policy - not necessarily saying there was an issue. We've taken a close look at this and I think we're in good shape her vis-a-vis this policy. - Patrick On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hitesh, From your link: You may not use ASF trademarks such as Apache or ApacheFoo or Foo in your own domain names if that use would be likely to confuse a relevant consumer about the source of software or services provided through your website, without written approval of the VP, Apache Brand Management or designee. The title on the packages website is A community index of packages for Apache Spark. Furthermore, the footnote of the website reads Spark Packages is a community site hosting modules that are not part of Apache Spark. I think there's nothing on there that would confuse a relevant consumer about the source of software. It's pretty clear that the Spark Packages name is well within the ASF's guidelines. Have I misunderstood the ASF's policy? Nick On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote: Hello Xiangrui, If you have not already done so, you should look at http://www.apache.org/foundation/marks/#domains for the policy on use of ASF trademarked terms in domain names. thanks -- Hitesh On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I'm happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I'd like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Announcing Spark Packages
Okie doke! (I just assumed there was an issue since the policy was brought up.) On Mon Dec 22 2014 at 8:33:53 PM Patrick Wendell pwend...@gmail.com wrote: Hey Nick, I think Hitesh was just trying to be helpful and point out the policy - not necessarily saying there was an issue. We've taken a close look at this and I think we're in good shape her vis-a-vis this policy. - Patrick On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hitesh, From your link: You may not use ASF trademarks such as Apache or ApacheFoo or Foo in your own domain names if that use would be likely to confuse a relevant consumer about the source of software or services provided through your website, without written approval of the VP, Apache Brand Management or designee. The title on the packages website is A community index of packages for Apache Spark. Furthermore, the footnote of the website reads Spark Packages is a community site hosting modules that are not part of Apache Spark. I think there's nothing on there that would confuse a relevant consumer about the source of software. It's pretty clear that the Spark Packages name is well within the ASF's guidelines. Have I misunderstood the ASF's policy? Nick On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote: Hello Xiangrui, If you have not already done so, you should look at http://www.apache.org/foundation/marks/#domains for the policy on use of ASF trademarked terms in domain names. thanks -- Hitesh On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I'm happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I'd like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits
Does this include contributions made against the spark-ec2 https://github.com/mesos/spark-ec2 repo? On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell pwend...@gmail.com wrote: Hey All, Due to the very high volume of contributions, we're switching to an automated process for generating release credits. This process relies on JIRA for categorizing contributions, so it's not possible for us to provide credits in the case where users submit pull requests with no associated JIRA. This needed to be automated because, with more than 1000 commits per release, finding proper names for every commit and summarizing contributions was taking on the order of days of time. For 1.2.0 there were around 100 commits that did not have JIRA's. I'll try to manually merge these into the credits, but please e-mail me directly if you are not credited once the release notes are posted. The notes should be posted within 48 hours of right now. We already ask that users include a JIRA for pull requests, but now it will be required for proper attribution. I've updated the contributing guide on the wiki to reflect this. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits
Hey Josh, We don't explicitly track contributions to spark-ec2 in the Apache Spark release notes. The main reason is that usually updates to spark-ec2 include a corresponding update to spark so we get it there. This may not always be the case though, so let me know if you think there is something missing we should add. - Patrick On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Does this include contributions made against the spark-ec2 repo? On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell pwend...@gmail.com wrote: Hey All, Due to the very high volume of contributions, we're switching to an automated process for generating release credits. This process relies on JIRA for categorizing contributions, so it's not possible for us to provide credits in the case where users submit pull requests with no associated JIRA. This needed to be automated because, with more than 1000 commits per release, finding proper names for every commit and summarizing contributions was taking on the order of days of time. For 1.2.0 there were around 100 commits that did not have JIRA's. I'll try to manually merge these into the credits, but please e-mail me directly if you are not credited once the release notes are posted. The notes should be posted within 48 hours of right now. We already ask that users include a JIRA for pull requests, but now it will be required for proper attribution. I've updated the contributing guide on the wiki to reflect this. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits
s/Josh/Nick/ - sorry! On Mon, Dec 22, 2014 at 10:52 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Josh, We don't explicitly track contributions to spark-ec2 in the Apache Spark release notes. The main reason is that usually updates to spark-ec2 include a corresponding update to spark so we get it there. This may not always be the case though, so let me know if you think there is something missing we should add. - Patrick On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Does this include contributions made against the spark-ec2 repo? On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell pwend...@gmail.com wrote: Hey All, Due to the very high volume of contributions, we're switching to an automated process for generating release credits. This process relies on JIRA for categorizing contributions, so it's not possible for us to provide credits in the case where users submit pull requests with no associated JIRA. This needed to be automated because, with more than 1000 commits per release, finding proper names for every commit and summarizing contributions was taking on the order of days of time. For 1.2.0 there were around 100 commits that did not have JIRA's. I'll try to manually merge these into the credits, but please e-mail me directly if you are not credited once the release notes are posted. The notes should be posted within 48 hours of right now. We already ask that users include a JIRA for pull requests, but now it will be required for proper attribution. I've updated the contributing guide on the wiki to reflect this. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org